Relational Data Tools: SQL, Awk, Pig

a side-by-side reference sheet

grammar and invocation | variables and expression | arithmetic and logic | strings | regexes | dates and time | arrays | dictionaries | tables | relational algebra | aggregation | functions | execution control | files | libraries and namespaces | reflection

version used
PostgreSQL 9.0200705010.9
show version
> SELECT version();$ awk --version$ pig --version
grammar and invocation
$ psql -f foo.sql$ awk -f foo.awk bar.txt$ pig -f foo.pig
$ psqlnone$ pig -x local
input data formatmultiple tables defined by create table statementssingle field and record delimited file. By default fields are delimited by whitespace and records by newlinesmultiple files. Default loading function is PigStorage and default delimiter is tab
statement separator;; or newline; and newline when using REPL
block delimitersnone in SQL; PL/SQL uses keywords to delimit blocks{ }
end-of-line comment-- comment# comment-- comment
multiple line commentnonenone/* comment
another comment */
variables and expressions
case sensitive?noyesfunctions and aliases are case sensitive; commands and operators are not
quoted identifierCREATE TABLE "select" (
  "foo bar" INT

CREATE TABLE `select` (
   `foo bar` INT
null test
foo IS NULLfoo == ""in filter clause:
is null
coalesce(foo, 0)foo is null ? 0 : foo
nullifnullif(foo, -999)
conditional expressionCASE WHEN x > 0 THEN x ELSE -x ENDx > 0 ? x : -xx > 0 ? x : -x
arithmetic and logic
boolean type
BOOLEANvalues are untypednone
true and false

the predicate of a where clause evaluates as false if it contains a null
0 ""none; the first operand of a conditional expression must be a comparison operator expression
logical operatorsAND OR NOT&& || !in filter clause:
and or not
logical expression
relational operators= != > < >= <=== != > < >= <=== != > < >= <=
integer typesmallint 2 bytes
int 4 bytes
bigint 8 bytes
variables are untypedint 4 bytes
long 8 bytes
float typefloat 4 bytes
double 8 bytes
variables are untypedfloat 4 bytes
double 8 bytes
fixed type
numeric(precision, scale)variables are untypednone
arithmetic operators
addition, subtraction, multiplication, division, remainder
+ - * / %+ - * / %+ - * / %
integer division
13 / 5int(13 / 5)13 / 5
integer division by zeroerrorerrornull
float division
cast(13 as float) / 513 / 51.0 * 13 / 5
float division by zeroerrorerrornull
power2 ^ 322 ^ 32
2 ** 32
sqrt -1errorgenerates a warning:
transcendental functionsexp ln sin cos tan asin acos atan atan2exp log sin cos none none none none atan2EXP LOG SIN COS TAN ASIN ACOS ATAN none
float truncationcast(2.7 as int)
absolute value
random number
bit operators<< >> & | # ~gawk:
lshift(val, cnt)
rshift(val, cnt)
and(val1, val2)
or(val1, val2)
xor(val1, val2)
variables are untypedchararray
'don''t say "no"'

PostgreSQL escape literal:
E'don\'t say "no"'
"don't say \"no\""'don\'t say "no"'
length('lorem ipsum')length("lorem ipsum")SIZE('lorem ipsum')
escapesno backslash escapes in SQL standard string literals

in PostgreSQL escape literals:
\b \f \n \r \t \ooo \xhh \uhhhh \uhhhhhhhh
\\ \" \a \b \f \n \r \t \v \ooo \xhh\n \t \uhhhh
'Hello' || ', ' || 'World!'("Hello" " " "World!")CONCAT(CONCAT('Hello', ', '), 'World!')
  'do re mi', ' '
split("do re mi", a, " ")STRSPLIT('do re mi', ' ')
case manipulationupper('lorem')
striptrim(' lorem ')
ltrim(' lorem')
rtrim('lorem ')
TRIM(' lorem ')
index of substringindex starts from 1; returns 0 if not found:
strpos('lorem ipsum', 'ipsum')
index starts from 1; returns 0 if not found:
index("lorem ipsum", "ipsum")
index starts from 0; returns -1 if not found:
INDEXOF('lorem ipsum', 'ipsum', 0)
extract substringsubstr('lorem ipsum', 7, 5)substr("lorem ipsum", 7, 5)SUBSTRING('lorem ipsum', 6, 11)
sprintfselect format('%s %s %s', 'foo', 7, 13.2);sprintf("%s %d %f", "foo", 7, 13.2)REGISTER /PATH/TO/piggybank-0.3-amzn.jar;
DEFINE FORMAT org.apache.pig.piggybank.

  FORMAT('%s %d %f', 'foo', 7, 13.2);
regular expressions
matchselect *
from pwt
where name similar to 'r[a-z]+';
matching inside pattern:
$1 ~ /^r[a-z]+/ { print $0 }

matching inside action:
{ if (match($1, /^r[a-z]+$/)) print $0 }
root_pwf = filter pwf
  by SIZE(REGEX_EXTRACT(name, '^(root)$',1)) > 0;
select regexp_replace('foo bar', 'bar$', 'baz');s = "foo bar"
sub(/bar$/, "baz", s)
extract subgroupselect (regexp_matches('foobar', '(f..)bar'))[1];properties = FOREACH urls GENERATE FLATTEN(EXTRACT($0, '^https?://([^/]+)')) as host:chararray;
dates and time
current date and timenow()gawk:
REGISTER /PATH/TO/piggybank-0.3-amzn.jar;
DEFINE DATE_TIME org.apache.pig.piggybank.

bar = FOREACH foo GENERATE DATE_TIME('-00:00');
datetime to stringto_char(now(), 'YYYY-MM-DD HH24:MI:SS')gawk:
strftime("%Y-%m-%d %H:%M:%S", systime())
REGISTER /PATH/TO/piggybank-0.3-amzn.jar;
DEFINE FORMAT_DT org.apache.pig.piggybank.

baz = FOREACH bar GENERATE FORMAT_DT('yyyy-MM-dd HH::mm:ss', $0);
string to datetimeto_timestamp('2011-09-26 00:00:47', 'YYYY-MM-DD HH24:MI:SS')gawk:
mktime("2011 09 26 00 00 47")
create temp table foo ( a int[] );
insert into foo values ( '{1,2,3}' );
noneone_row = LOAD '/tmp/one_line.txt';
foo = FOREACH one_row GENERATE (1,2,3);
select array_upper(a, 1) from foo;
none, but split function returns lengthbar = FOREACH foo GENERATE SIZE(a);
select a[1] from foo;
a[0]bar = FOREACH foo GENERATE a.$0;
iterationfor (i in a) print i, a[i]
literal['t'#1, 'f'#0]
order rows by columnordering depends on type of column
select name
from pwt
order by name;
ordering is lexical unless -n flag is used
$ sort -k1 -t: /etc/passwd
names = foreach pwf generate name;
ordered_names = order names by name;
order rows by multiple columnsselect group, name
from pwt
order by group, name;
order rows in descending orderselect name
from pwt
order by name;
$ sort -k1 -r -t: /etc/passwd
limitselect name from pwt order by name limit 10;first_ten = limit ordered_names 10;
offsetselect name from pwt order by name limit 10 offset 10;
relational algebra
project columns by nameselect name, pw from pwt;nonename_pw = foreach pwf generate name, pw;
project columns by positionnone{ print $1, $2 }name_pw = foreach pwf generate $0, $1;
project expression
project all columnsselect * from pwt;# prints input line:
{ print $0 }
pwf2 = foreach pwf generate *;
rename columnsselect uid as userid, gid as groupid
from pwf;
noneusergroups = foreach pwf generate uid as userid, gid as groupid;
select rowsselect * from pwt where name = 'root';$1 == "root" { print $0 }pwf3 = filter pwf by name == 'root';
select distinct rowsselect distinct gid from pwt;$ cat > gid.awk
BEGIN { FS=":" }
!/^#/ { print $4 }
$ awk -f gid.awk /etc/passwd > /tmp/x
$ sort -u /tmp/x
gids1 = foreach pwf generate gid;
gids2 = distinct gids1;
split rowssplit pwf into rootpwf if name == 'root', otherpwf if name != 'root';
inner joincreate temp table gt (
  name text, pw text,
  gid int, members text

copy gt from '/etc/group'
with delimiter ':';

select * from pwt
join gt on pwt.gid = gt.gid;
$ awk '!/^#/' /etc/passwd > /tmp/x
$ sort -k4,4 -t: < /tmp/x > /tmp/pw

$ awk '!/^#/' /etc/group > /tmp/y
$ sort -k3,3 -t: < /tmp/y > /tmp/g

$ join -t: -14 -23 /tmp/pw /tmp/g
gf = load '/etc/group' using PigStorage(':') as (name:chararray, pw:chararray, gid:int, members:chararray);

pwgf = join pwf by gid, gf by gid;
nulls as join valuesinput tuples with nulls for join values are omitted from inner joinsno null value; empty strings are joinable valuesinput tuples with nulls for join values are omitted from inner joins
self joinselect,
from pwt as pwt1, pwt as pwt2
where pwt1.gid = pwt2.gid;
$ join -t: -14 -24 /tmp/pw /tmp/pwpwf2 = foreach pwf generate *;
joined_by_gid = join pwf by gid, pwf2 by gid;
name_pairs = foreach by_group generate pwf::name, pwf2::name;
left joinselect *
from customers c
left join orders o
  on = o.customer_id;
$ join -t: -a1 -11 -22 /tmp/c /tmp/oj = join customers by id left, orders by customer_id;
full joinselect *
from customers c
full join orders o
  on = o.customer_id;
$ join -t: -a1 -a2 -11 -22 /tmp/c /tmp/oj = join customers by id full, orders by customer_id;
cross joincreate table files ( file text );
insert into files values ('a'), ('b'), ('c'), ('d'), ('e'), ('f'), ('g'), ('h');

create table ranks ( rank int );
insert into ranks values (1), (2), (3), (4), (5), (6), (7), (8);

create table chessboard ( file text, rank int );
insert into chessboard
select *
from files, ranks;
specify nonexistent join fields:

$ join -12 -22 /tmp/f /tmp/r
files = load '/tmp/files.txt' as (file:chararray);

ranks = load '/tmp/ranks.txt' as (rank:int);

chessboard = cross files, ranks;
group byselect gid, count(*)
from pwf
group by gid;
BEGIN { FS=":" }
!/^#/ { cnt[$4]++ }
END { for (i in cnt) print i, cnt[i] }
by_group = group pwf by gid;
cnts = foreach by_group generate $0, COUNT($1);
group by multiple columnsselect gid, sh, count(*)
from pwf
group by gid, sh;
by_group = group pwf by (gid, sh);
cnts = foreach by_group generate FLATTEN($0), COUNT($1);
aggregation functionscount sum min max avg stddevnoneCOUNT SUM MIN MAX AVG none
define functioncreate or replace function
  add ( x int, y int )
  returns int as $$
  return x + y;
end; $$ language plpgsql;
can be defined at position of pattern-action statement:

function add(x, y) { return x+y }
in org/hpglot/
package org.hpglot;

import org.apache.pig.EvalFunc;

public class Add extends EvalFunc<Integer> {
  @Override public Integer
  exec(Tuple input) throws IOException {
    Integer x, y;
    if (input == null @@
invoke functionselect add(x, y) from foo;{ print add($1, $2) }REGISTER Add.jar
DEFINE ADD org.hpglot.Add();

foo = LOAD 'foo.txt' AS (x:int, y:int);
bar = FOREACH foo GENERATE ADD(x, y);
drop function
drop function add(integer, integer);nonenone
execution control
ifif (!0) print "foo"; else print "bar"none
whilei = 0; while (i<5) print i++none
forfor (i=0; i<5; i++) print inone
set field delimiter$ grep -v '^#' /etc/passwd > /tmp/pw

create table pwt ( name text, pw text, uid int, gid int, gecos text, home text, sh text );

copy pwt from '/tmp/pw' with delimiter ':';
BEGIN { FS=":" }pwf = LOAD '/etc/passwd' USING PigStorage(':') AS (name:chararray, pw:chararray, uid:int, gid:int, gecos:chararray, home:chararray, sh:chararray);
write table to fileSTORE foo INTO '/tmp/';
libraries and namespaces
table schemaDESCRIBE foo;


versions used

The versions used for testing code in the reference sheet.

show version

How to get the version.

Grammar and Invocation


How to run the interpreter on a script.


How to invoke the REPL.

statement separator

The statement separator.

block delimiters

The delimiters used for blocks.

end-of-line comment

How to create a comment that ends at the next newline.

multiple line comment

How to comment out multiple lines.

Variables and Expressions

case sensitive?

Are identifiers which differ only by case treated as distinct identifiers?

quoted identifier

How to quote an identifier.

Quoting an identifier is a way to include characters which aren't normally permitted in an identifier.

In SQL quoting is also a way to refer to an identifier that would otherwise be interpreted as a reserved word


The null literal.


PigStorage, the default function for loading and persisting relations, represents a null value with an empty string. Null is distinct from an empty string since null == '' evaluates as false. Thus PigStorage cannot load or store an empty string.

null test

How to test whether an expression is null.


The expression null = null evaluates as null, which is a ternary boolean value distinct from true and false. Expressions built up from arithmetic operators or comparison operators which contain a null evaluate as null. When logical operators are involved, null behaves like the unknown value of Kleene logic.


How to use the value of an expression, replacing it with an alternate value if it is null.


How to use the value of an expression, replacing a specific value with null.

conditional expression

The syntax for a conditional expression.

Arithmetic and Logic

true and false

Literals for true and false.


Values which evaluate as false in a boolean context.

logical operators

The logical operators. Logical operators impose a boolean context on their arguments and return a boolean value.

relational operators

The comparison operators, also known as the relational operators.

integer type

Integer types.


Datatypes are database specific, but the mentioned types are provided by both PostgreSQL and MySQL.


Variables are untyped and implicit conversions are performed between numeric and string types.

The numeric literal for zero, 0, evaluates as false, but the string "0" evaluates as true. Hence we can infer that awk has at least two distinct data types.

float type

Floating point decimal types.


Datatypes are database specific, but the mentioned types are provided by both PostgreSQL and MySQL.

fixed type

Fixed precision decimal types.


Datatypes are database specific, but the mentioned types are provided by both PostgreSQL and MySQL.

arithmetic operations

The arithmetic operators: addition, subtraction, multiplication, division, modulus, and exponentiation.

integer division

How to compute the quotient of two numbers. The quotient is always an integer.

integer division by zero

What happens when an integer is divided by zero.


Division by zero evaluates to null. Recall that PigStorage stores nulls in files as empty strings.

float division

How to perform floating point division, even if the operands are integers.

float division by zero

The result of dividing a float by zero.


Division by zero evaluates to null. Recall that PigStorage stores nulls in files as empty strings.


How to raise a number to a power.


How to get the square root of a number.

sqrt -1

The result of taking the square root of negative one.

transcendental functions

The standard transcendental functions of mathematics.

float truncation

How to truncate floats to integers. The functions (1) round towards zero, (2) round to the nearest integer, (3) round towards positive infinity, and (4) round towards negative infinity. How to get the absolute value of a number is also illustrated.

absolute value

The absolute value of a number.

random number

How to create a unit random float.



The available string types.


A chararray is a string of Unicode characters. Like in Java the characters are UTF-16 encoded.

A bytearray is a string of bytes. Data imported into Pig is of type bytearray unless declared otherwise.


The syntax for string literals.


MySQL also has double quoted string literals. PostgreSQL and most other database use double quotes for identifiers.

In a MySQL double quoted string double quote characters must be escaped with reduplication but single quote characters do not need to be escaped.


Single quoted string literals are of type chararray. There is no syntax for a bytearray literal.


How to get the length of a string.


Escape sequences which are available in string literals.


Here is a portable way to include a newline character in a SQL string:

select 'foo' || chr(10) || 'bar';

MySQL double and single quoted strings support C-style backslash escapes. Backslash escapes are not part of the SQL standard. Their interpretation can be disabled at the session level with



How to concatenate strings.


How to split a string into an array of substrings.


How to split a string into multiple rows of data:

=> create temp table foo ( bar text );

=> insert into foo select regexp_split_to_table('do re mi', ' ');

case manipulation

How to uppercase a string; how to lower case a string; how to capitalize the first character.


How to remove whitesapce from the edges of a string.

index of substring

How to get the leftmost index of a substring in a string.

extract substring

How to extract a substring from a string.


How to create a string from a format.

Regular Expressions



Pig does not directly support a regex match test. The technique illustrated is to extract a subgroup and see whether the resulting tuple has anything in it. This can be done in the by clause of a filter statement, but not as the first operand of a conditional expression.


How to perform substitution on a string.


sub and the global variant gsub return the number of substitutions performed.

extract subgroup

Date and Time



The syntax for an array literal.


The syntax for arrays is specific to PostgreSQL. MySQL does not support arrays.

Defining a column to be an array violates first normal form.


Pig tuples can be used to store a sequence of data in a field. Pig tuples are heterogeneous; the components do not need to be of the same type.


How to get the number of elements in an array.


How to get the value in an array by index.


How to change a value in an array.


How to iterate through the values of an array.




Dictionaries are called maps in Pig. The keys must be character arrays, but the values can be any type.


order by

How to sort the rows in table using the values in one of the columns.

order by multiple columns

How to sort the rows in a table using the values in multiple columns. If the values in the first column are the same, the values in the seconds column are used as a tie breaker.



Relational Algebra

In a mapping operation the output relation has the same number of rows as the input relation. A mapping operation can be specified with a function which accepts an input record and returns an output record.

In a filtering operation the output relation has a less than or equal number of rows as the input relation. A filtering operation can be specified with a function which accepts an input record and returns a boolean value.

input data format

The data formats the language can operate on.

set field delimiter

For languages which can operate on field and record delimited files, how to set the field delimiter.


The PostgreSQL copy command requires superuser privilege unless the input source is stdin. Here is an example of how to use the copy command without superuser privilege:

$ ( echo "copy pwt from stdin with delimiter ':';"; cat /tmp/pw ) | psql

The copy command is not part of the SQL standard. MySQL uses the following:

load data infile '/etc/passwd' into table pwt fields terminated by ':';

Both PostgreSQL and MySQL will use tab characters if a field separator is not specified. MySQL permits the record terminator to changed from the default newline, but PostgreSQL does not.

select column by name

How to select fields by name.

select column by position

select all columns

rename columns

filter rows

split rows

An aggregation operation is similar to a filtering operation in that it accepts an input relation and produces an output relation with less than or equal number of rows. An aggregation is defined by two functions: a partitioning function which accepts a record and produces a partition value, and a reduction function which accepts a set of records which share a partition value and produces an output record.

select distinct

How to remove duplicate rows from the output set.

Removing duplicates can be accomplished with an aggregation operation in which the partition value is the entire row and a reduction function which returns the first row in the set of rows sharing the partition value.

inner join

In an inner join, only tuples from the input relations which satisfy a join predicate are used in the output relation.

A special but common case is when the join predicate consists of an equality test or a conjunction of two or more equality tests. Such a join is called an equi-join.


If awk is available at the command line then chances are good that join is also available.

null treatment in joins

How rows which have a null value for the join column are handled.

Both SQL and Pig do not include such rows in the output relation unless an outer join (i.e. a left, right, or full join) is specified. Even in the case of an outer join the rows with a null join column value are not joined with any rows from the other relations, even if there are also rows in the other relation with null join column values. Instead the columns that derive from the other input relation will have null values in the output relation.

self join

A self join is when a relation is joined with itself.

If a relation contained a list of people and their parents, then a self join could be used to find a persons grandparents.


An alias be used in a JOIN statement more than once. Thus to join a relation with itself it must first be copied with a FOREACH statement.

left join

How to include rows from the input relation listed on the left (i.e. listed first) which have values in the join column which don't match any rows from the input relation listed on the right (i.e. listed second). The term in short for left outer join.

As an example, a left join between customers and orders would have a row for every order placed and the customer who placed it. In addition it would have rows for customers who haven't placed any orders. Such rows would have null values for the order information.


Here is a complete example with the schemas and data used in the left join:

create table customers ( id int, name text );

insert into customers values ( 1, 'John' ), ( 2, 'Mary' ), (3, 'Jane');

create table orders ( id int, customer_id int, amount numeric(9, 2));

insert into orders values ( 1, 2, 12.99 );
insert into orders values ( 2, 3, 5.99 );
insert into orders values ( 3, 3, 12.99 );

select * from customers c left join orders o on = o.customer_id;
 id | name | id | customer_id | amount 
  1 | John |    |             |       
  2 | Mary |  1 |           2 |  12.99
  3 | Jane |  2 |           3 |   5.99
  3 | Jane |  3 |           3 |  12.99
(4 rows)

left outer join is synonymous with left join. The following query is identical to the one above:

select *
from customers c
left outer join orders o
  on = o.customer_id;


For a complete example assume the following data is in /tmp/customers.txt:


and /tmp/orders.txt:


Here is the Pig session:

customers = LOAD '/tmp/customers.txt' USING PigStorage(':') AS (id:int, name:chararray);
orders = LOAD '/tmp/orders.txt' USING PigStorage(':') AS (id:int, customer_id:int, amount:float);
j = join customers by id left, orders by customer_id;
dump j;

Here is the output:


full join

A full join is a join in which rows with null values for the join condition from both input relations are included in the output relation.

Left joins, right joins, and full joins are collectively called outer joins.


We illustrate a full join by using the schema and data from the left join example and adding an order with a null customer_id:

insert into orders values ( 4, null, 7.99);

select * from customers c full join orders o on = o.customer_id;
 id | name | id | customer_id | amount 
  1 | John |    |             |       
  2 | Mary |  1 |           2 |  12.99
  3 | Jane |  2 |           3 |   5.99
  3 | Jane |  3 |           3 |  12.99
    |      |  4 |             |   7.99
(5 rows)


For a complete example assume the following data is in /tmp/customers.txt:


and /tmp/orders.txt:


Here is the Pig session:

customers = LOAD '/tmp/customers.txt' USING PigStorage(':') AS (id:int, name:chararray);
orders = LOAD '/tmp/orders.txt' USING PigStorage(':') AS (id:int, customer_id:int, amount:float);
j = join customers by id full, orders by customer_id;
dump j;

Here is the output:


cross join

A cross join is a join with no join predicate. It is also called a Cartesian product. If the input relations have N1, N2, …, Nm rows respectively, then the output relation has $$\prod_{i=1}^{m} N_i$$ rows.


group by


The columns in the select clause of a select statement with a group by clause must be expressions built up of columns listed in the group by clause and aggregation functions. Aggregation functions can contain expressions containing columns not in the group by clause as arguments.


The output relation of a GROUP BY operation is always a relation with two fields. The first is the partition value, and the second is a bag containing all the tuples in the input relation which have the partition value.

group by multiple columns

How to group by multiple columns. The output relation will have a row for each distinct tuple of column values.


Tuples must be used to group on multiple fields. Tuple syntax is used in the GROUP BY statement and the first field in the output relation will be tuple. The FLATTEN function can be used to replace the tuple field with multiple fields, one for each component of the tuple.

aggregation functions

The aggregation functions.


Rows for which the expression given as an argument of an aggregation function is null are excluded from a result. In particular, COUNT(foo) is the number of rows for which the foo column is not null. COUNT(*) always returns the number of rows in the input set, including even rows with columns that are all null.


The Pig aggregation functions operate on bags. The GROUP BY operator produces a relation of tuples in which the first field is the partition value and the second is a bag of all the tuples which have the partition value. Since $1 references the second component of a tuple, it is often the argument to the aggregation functions.

A join is an operation on m input relations. If the input relations have n1, n2, …, nm columns respectively, then the output relation has $$\sum_{i=1}^{m} n_i$$ columns.


define function

How to define a function.


To be able to write PL/pgSQL functions on a PostgreSQL database, someone with superuser privilege must run the following command:

create language plpgsql;

invoke function

How to invoke a function.

drop function

How to remove a function.


PL/pgSQL permits functions with the same name and different parameter types. Resolution happens at invocation using the types of the arguments.

When dropping the function the parameter types must be specified. There is no statement for dropping multiple functions with a common name.

Execution Control


How to execute code conditionally.


How to implement a while loop.


How to implement a C-style for loop.


Library and Namespaces



PostgreSQL 9.1: The SQL Language
MySQL 5.6 Reference Manual

SQL has been the leading query language for relational databases since the early 1980s. It received its first ISO standardization in 1986.

SQL statements are classified into three types: data manipulation language (DML), data definition language (DDL), and data control language (DCL). DDL defines and alters the database schema. DCL controls the privileges of database users. DML queries and modifies the data in the tables.

sql statementtype


awk - pattern-directed scanning and processing language
POSIX specification for awk
POSIX specification for join
POSIX specification for sort

Awk has been included on all Unix systems since 7th Edition Unix in 1979. It provides a concise language for performing transformations on files. An entire program can be provided to the awk interpreter as the first argument on the command line. Because awk string literals use double quotes, single quotes are usually used to quote the awk program for the benefit of the shell. Here's an example which prints the default shell used by root:

awk 'BEGIN{FS=":"} $1=="root" {print $7}' /etc/passwd

An awk script is sequence of pattern-action pairs. Awk will iterate through the lines of standard input or the lines of the specified input files, testing each pattern against the line and executing the corresponding action if the pattern matches.

Patterns are usually slash delimited regular expressions, e.g. /lorem/, and logical expressions built up from them using the logical operators &&, ||, and !. If an action is provided without an accompanying pattern, the action is executed once for every line of input. The keywords BEGIN and END are special patterns which cause the following action to be executed once at the start and end of execution, respectively.


Apache Pig docs

Pig is language for specifying Hadoop map reduce jobs. Pig scripts are shorter than equivalent Java source code, especially if joins are required.

There are products such as Hive which can convert an SQL statement to a map reduce job, but Pig has an advantage over Hive in that it can handle a greater variety of data formats in the input files.

Although Pig is intended to be used with a Hadoop grid, Hadoop is not required to run a Pig job. Running a Pig job locally is a convenient way to test a Pig job before running it on a grid.

In addition to some numeric and string data types, Pig provides three compound data types: bag, tuples, and map. A bag is an array of tuples. It is equivalent to a database table; it is the database type which Pig uses to hold data which it reads in from files.

Pig has a limited type of variable called an alias. The only data type which can be stored in an alias is a bag. When a bag is stored in an alias it is called a relation or an outer bag. A bag can also be stored in the field of a tuple, in which case it is called an inner bag.

pig relational operators:

Pig provides the following 15 operators for manipulating relations:

relational operatorinput relationsoutput relationsoutput rows
CROSS2+1N1 x N2 x …
DISTINCT11N or fewer
FILTER11N or fewer
GROUP1+1number of groups
JOIN2+1bounded by N1 x N2 x …
LIMIT11min(N, limit argument)
LOAD01lines in file
MAPREDUCE01depends on mapreduce job
SAMPLE11N * fraction argument
SPLIT1ncan exceed N if split conditions overlap
STREAM11depends on external script
UNION2+1N1 + N2 + … + Nn

Most of the above operators create a new relation from existing relations. Exceptions are LOAD and MAPREDUCE which create relations from external files, STORE which writes a relation to a file, and SPLIT which can create more than one relation.

piggybank UDFs:

It is easy to write user defined functions (UDFs) in Java and make them available to pig. piggybank.jar and piggybank-0.3-amzn.jar are two publicly available libraries of UDFs.

If the Piggybank jar is in the home directory when the Pig script is run, the functions can be made available with the following code at the top of the Pig script:

REGISTER /PATH/TO/piggybank.jar;
REGISTER /PATH/TO/piggybank-0.3-amzn.jar;

DEFINE DATE_TIME    org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT();
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
DEFINE REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE();
issue tracker | content of this page licensed under creative commons attribution-sharealike 3.0