Apache Pig provides Filter operators to select required data by putting a condition on the relation.

Apache Pig supports the below list of filter operators.

  1. Filter Operator
  2. Distinct Operator
  3. Foreach Operator

We have used the “department.txt” dataset to perform these operations. We will put “department.txt” in HDFS location “/pigexample/” from the local file system.

Content of “department.txt”:


1001,Bette,Nicka,LA,70116 1002,Veronika,Inouye,MI,48116 1003,Willard,Kolmetz,NJ,8014 1004,Maryann,Royster,AK,99501 1005,Alisha,Slusarski,OH,45011 1006,Allene,Iturbide,OH,44805 1007,Chanel,Caudy,IL,60632 1008,Ezekiel,Chui,CA,95111 1009,Willow,Kusko,SD,57105 1010,Bernardo,Figeroa,MD,21224

We will load “department.txt” from the local filesystem into HDFS “/pigexample/” using the below commands.

Command:
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/department.txt /pigexample/

Now we will create a relation and load data from HDFS to Pig.

Command:
grunt> deptdata = LOAD '/pigexample/department.txt' USING PigStorage(',') as (deptid:int,empname:chararray,city:chararray,state:chararray,zip:int );


1. Filter Operator

Filter operator is used to selects tuples from a relation based on some condition.

Syntax:
grunt> alias = FILTER alias  BY expression;

We will use filter condition to select city == LA from relation “deptdata” and using the DUMP operator we will print records on the terminal.

Command:
grunt> filterdata = FILTER deptdata BY state = 'LA';
grunt> DUMP filterdata;

Output:
filter operator example


2. Distinct Operator

The DISTINCT operator is used to remove the duplicate tuples in a relation.

Syntax:
grunt> alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n];

We will use a distinct operator to remove duplicate tuples from relation “deptdata” and using the DUMP operator we will print records on the terminal.

Command:
grunt> distinctdata = DISTINCT deptdata;
grunt> DUMP distinctdata;

Output:
distinct operator example


3. Foreach Operator

FOREACH operator generates the transformation data which is based on the data of columns.

Syntax:
grunt> alias  = FOREACH { block | nested_block };

We will use the FOREACH operator to select ‘empname,’ ‘city’ and ‘state’ from relation “deptdata” and save into other relation “and using DUMP operator we will print records on the terminal.

Command:
grunt> foreachdata = FOREACH deptdata GENERATE empname,city,state;
grunt> DUMP foreachdata;

Output:
foreach example