Apache Pig Filter

Apache Pig provides Filter operators to select required data by putting a condition on the relation.

Apache Pig supports the below list of filter operators.

Filter Operator
Distinct Operator
Foreach Operator

We have used the “department.txt” dataset to perform these operations. We will put “department.txt” in HDFS location “/pigexample/” from the local file system.

Content of “department.txt”:



    1001,Bette,Nicka,LA,70116
    1002,Veronika,Inouye,MI,48116
    1003,Willard,Kolmetz,NJ,8014
    1004,Maryann,Royster,AK,99501
    1005,Alisha,Slusarski,OH,45011
    1006,Allene,Iturbide,OH,44805
    1007,Chanel,Caudy,IL,60632
    1008,Ezekiel,Chui,CA,95111
    1009,Willow,Kusko,SD,57105
    1010,Bernardo,Figeroa,MD,21224

We will load “department.txt” from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:

$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/department.txt /pigexample/

Now we will create a relation and load data from HDFS to Pig.

Command:

grunt> deptdata = LOAD '/pigexample/department.txt' USING PigStorage(',') as (deptid:int,empname:chararray,city:chararray,state:chararray,zip:int );

1. Filter Operator

Filter operator is used to selects tuples from a relation based on some condition.

Syntax:

grunt> alias = FILTER alias BY expression;

We will use filter condition to select city == LA from relation “deptdata” and using the DUMP operator we will print records on the terminal.

Command:

grunt> filterdata = FILTER deptdata BY state = 'LA';
grunt> DUMP filterdata;

Output:

filter operator example

2. Distinct Operator

The DISTINCT operator is used to remove the duplicate tuples in a relation.

Syntax:

grunt> alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n];

We will use a distinct operator to remove duplicate tuples from relation “deptdata” and using the DUMP operator we will print records on the terminal.

Command:

grunt> distinctdata = DISTINCT deptdata;
grunt> DUMP distinctdata;

Output:

distinct operator example

3. Foreach Operator

FOREACH operator generates the transformation data which is based on the data of columns.

Syntax:

grunt> alias = FOREACH { block | nested_block };

We will use the FOREACH operator to select ‘empname,’ ‘city’ and ‘state’ from relation “deptdata” and save into other relation “and using DUMP operator we will print records on the terminal.

Command:

grunt> foreachdata = FOREACH deptdata GENERATE empname,city,state;
grunt> DUMP foreachdata;

Output:

foreach example