Apache Pig Sort

Apache Pig provides Order By and Limit operators to perform shorting and restricting the relations.

ORDER BY
LIMIT

We have used the “finance_data.txt” dataset to perform these operations. We will put “finance_data.txt” in the HDFS location “/pigexample/” from the local file system.

Content of “finance.txt”:



    1,Chanel,Shawnee,KS,9133882079
    2,Ezekiel,Easton,MD,4106691642
    3,Willow,New York,NY,2125824976
    4,Bernardo,Conroe,TX,9363363951
    5,Ammie,Columbus,OH,6148019788
    6,Francine,Las Cruces,NM,5059773911
    7,Ernie,Ridgefield Park,NJ,2017096245
    8,Albina,Dunellen,NJ,7329247882
    9,Alishia,New York,NY,2128601579
    10,Solange,Metairie,LA,5049799175

We will load “finance.txt” from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:

$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/finance.txt /pigexample/

Now we will create a relation and load data from HDFS to Pig.

Command:

grunt> findata = LOAD '/pigexample/finance_data.txt' USING PigStorage(',') as (empid:int,empname:chararray,city:chararray,state:chararray,phone:int );

1. ORDER BY

ORDER BY operator is used to short the content of the relation based on one or more fields.

Syntax:

grunt> alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n];

We will perform ORDER BY operation on relation “findata” using the state column.

Command:

grunt> orderdata = ORDER findata BY state DESC;

Output:

Now we will use the DUMP operator to print the output of relation “orderdata” on screen.

Command:

grunt> DUMP orderdata;

Output:

2. LIMIT

The LIMIT operator provides a limited number of tuples for a relation.

Syntax:

grunt> alias = LIMIT alias n;

We will use the LIMIT operation to restrict the output of the relation “findata” to ten rows and using the DUMP operator we will print records on the terminal.

Command:

grunt> limitdata = LIMIT findata 10;
grunt> DUMP limitdata;

Apache Pig Sort

Pig - Combining & Splitting

Pig - Filter

Command:

Command:

1. ORDER BY

Syntax:

Command:

Output:

Command:

Output:

2. LIMIT

Syntax:

Command:

Output: