Apache Pig Grouping & Joining

Apache Pig Latin provides the below list of group and joins operators.

GROUP Operator
COGROUP Operator
JOIN Operator
CROSS Operator

Let us see each operator in detail.

1. GROUP Operator

The GROUP operator performs the grouping of the same group key tuples. After processing the generated output is a relation that has a tuple per group.

Syntax:

grunt> alias = GROUP alias { ALL | BY exp} [, alias ALL | BY exp …] [USING 'collected' | 'merge'] [PARTITION BY partit] [PARALLEL n];

To perform GROUP operation, first, we will load a file name “employee.txt” in HDFS “/pigexample/” location. The content of the file is mentioned below.



    1001,James,Butt,New Orleans,Orleans
    1002,Josephine,Darakjy,Brighton,Livingston
    1003,Art,Venere,Bridgeport,Gloucester
    1004,Lenna,Paprocki,Anchorage,Anchorage
    1005,Donette,Foller,Hamilton,Butler
    1006,Simona,Morasca,Ashland,Ashland
    1007,Mitsue,Tollner,Chicago,Cook
    1008,Leota,Dilliard,San Jose,Santa
    1009,Sage,Wieser,Sioux Falls,Minnehaha

Command:

grunt> employees = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );

a. GROUP by Single column

Will use the GROUP operator to perform grouping on schema “employees” by city.

After that, we will use the DUMP operator to print the records on the terminal.

Command:

grunt> groupdata = GROUP employees by city;
grunt> DUMP groupdata;

Output:

We can see from the output that two columns are generated.

The first column is a city that we used to group the data.
The second column is showing the city-wise data.

b. GROUP by Multi-Column

We will use the GROUP operator to perform grouping on relation “employees” by city and county.

After that, we will use the DUMP operator to print the records on the terminal.

Command:

grunt> groupdata = GROUP employees by (city,county);
grunt> DUMP groupdata;

Output:

2. COGROUP Operator

COGROUP operators are similar to the GROUP operator in which the GRPOUP operators works on one relation on the other hand the COGROUP operators work on the two and more relations.

Syntax:

grunt> alias = COGROUP alias { ALL | BY exp} [, alias ALL | BY exp …] [USING 'collected' | 'merge'] [PARTITION BY partit] [PARALLEL n];

To perform COGROUP operation we have taken two datasets “employee.txt” and “department.txt”. We will put both files in HDFS location “/pigexample/”.

Content of “employee.txt”:



    1001,James,Butt,New Orleans,Orleans
    1002,Josephine,Darakjy,Brighton,Livingston
    1003,Art,Venere,Bridgeport,Gloucester
    1004,Lenna,Paprocki,Anchorage,Anchorage
    1005,Donette,Foller,Hamilton,Butler
    1006,Simona,Morasca,Ashland,Ashland
    1007,Mitsue,Tollner,Chicago,Cook
    1008,Leota,Dilliard,San Jose,Santa
    1009,Sage,Wieser,Sioux Falls,Minnehaha

Content of “department.txt”:



    1001,James,Butt,New Orleans,Orleans
    1002,Josephine,Darakjy,Brighton,Livingston
    1003,Art,Venere,Bridgeport,Gloucester
    1004,Lenna,Paprocki,Anchorage,Anchorage
    1005,Donette,Foller,Hamilton,Butler
    1006,Simona,Morasca,Ashland,Ashland
    1007,Mitsue,Tollner,Chicago,Cook
    1008,Leota,Dilliard,San Jose,Santa
    1009,Sage,Wieser,Sioux Falls,Minnehaha

We will load both data files from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:

$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/employee.txt /pigexample/
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/department.txt /pigexample/

Now we will create relations for both files and load data from HDFS to Pig.

Command:

grunt> empdata = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );
grunt> deptdata = LOAD '/pigexample/department.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,state:chararray,zip:int );

We will group the records/tuples of the relation “empdata” and “deptdata” using city and state and print the result on the terminal using the DUMP operator.

Command:

grunt> cogroupdata = COGROUP empdata by city, deptdata by state;
grunt> DUMP cogroupdata;

Output:

COGROUP operator will group the tuples from each relation according to city and state and print the result.

3. JOIN Operator

The JOIN operators are used to JOIN two or more relations to get a desirable result. To perform the JOIN operation we declare one or more tuples from each relation.

Pig supports the below list of JOIN operators.

Self-Join
Inner-Join
Left Outer Join
Right Outer Join
Full Outer Join

Let us see each JOIN operator in detail.

To perform the JOIN operation we have taken two datasets “employee.txt” and “invoice.txt”. We will put both files in HDFS location “/pigexample/” from the local file system.

Content of “employee.txt”:



    1001,James,Butt,New Orleans,Orleans
    1002,Josephine,Darakjy,Brighton,Livingston
    1003,Art,Venere,Bridgeport,Gloucester
    1004,Lenna,Paprocki,Anchorage,Anchorage
    1005,Donette,Foller,Hamilton,Butler
    1006,Simona,Morasca,Ashland,Ashland
    1007,Mitsue,Tollner,Chicago,Cook
    1008,Leota,Dilliard,San Jose,Santa
    1009,Sage,Wieser,Sioux Falls,Minnehaha

Content of “invoice.txt”:



    1002,2019-10-08 00:00:00,23000
    1001,2012-10-08 00:00:00,23500
    1001,2012-11-20 00:00:00,13260
    1002,2011-05-20 00:00:00,56030
    1002,2011-10-08 00:00:00,34200
    1001,2010-10-08 00:00:00,32300
    1005,2001-11-20 00:00:00,12360
    1006,2001-05-20 00:00:00,32450

We will load both data files from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:

$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/employee.txt /pigexample/
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/invoice.txt /pigexample/

Now we will create relations for both files and load data from HDFS to Pig.

Command:

grunt> empdata = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );
grunt> invdata = LOAD '/pigexample/invoice.txt' USING PigStorage(',') as (custid:int,data:chararray,amount:int );

Now we are good to perform JOIN operations on these two datasets.

3.1 Self-Join

Self-JOIN performs the joining on a single table by creating multiple aliases. To perform Self-JOIN, we will load “employee.txt” data two times in different aliases.

Syntax:

grunt> alias = JOIN alias BY {expression|'('exp [, exp …]')'} (, alias BY {expression|'('exp [, exp …]')'} …) [USING 'replicated' | 'bloom' | 'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partit] [PARALLEL n];

Command:

grunt> empdata1 = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );
grunt> empdata2 = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );

Output:

Now we will perform the Self-JOIN operation by joining two relations (empdata1 and empdata2) and print the output on the terminal using the DUMP operator.

Command:

grunt> empdata3 = JOIN empdata1 BY empid, empdata2 BY empid;
grunt> DUMP empdata3;

Output:

3.2 Inner-Join

Inner-JOIN or Equi-JOIN is used to perform the joining of two or more relations based on common field values. Inner-JOIN ignores null keys.

Syntax:

grunt> alias = JOIN alias BY {expression|'('expression [, expression …]')'} (, alias BY {expression|'('expression [, expression …]')'} …) [USING 'replicated' | 'bloom' | 'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];

Now we will perform Inner-JOIN operation on two relations “empdata” and “invdata” and we will print output on-screen using the DUMP operator.

Command:

grunt> empdata = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );
grunt> invdata = LOAD '/pigexample/invoice.txt' USING PigStorage(',') as (custid:int,data:chararray,amount:int );
grunt> empinvoice = JOIN empdata BY empid, invdata BY custid;
grunt> DUMP empinvoice;

Output:

Apache Pig Grouping & Joining Pig - Diagnostic Operators Pig - Combining & Splitting

Pig - Diagnostic Operators

Pig - Combining & Splitting

1. GROUP Operator

Syntax:

Command:

a. GROUP by Single column

Command:

Output:

b. GROUP by Multi-Column

Command:

Output:

2. COGROUP Operator

Syntax:

Command:

Command:

Command:

Output:

3. JOIN Operator

Command:

Command:

3.1 Self-Join

Syntax:

Command:

Output:

Command:

Output:

3.2 Inner-Join

Syntax:

Command:

Output:

3.3 Left Outer-Join

Syntax:

Command:

Output:

3.4 Right Outer-Join

Syntax:

Command:

Output:

3.5 Full Outer-Join

Syntax:

Command:

Output:

4. CROSS Operator

Syntax:

Command:

Output:

Apache Pig Grouping & Joining