Apache Pig provides Filter operators to select required data by putting a condition on the relation.
Apache Pig supports the below list of filter operators.
We have used the “department.txt” dataset to perform these operations. We will put “department.txt” in HDFS location “/pigexample/” from the local file system.
Content of “department.txt”:
1001,Bette,Nicka,LA,70116
1002,Veronika,Inouye,MI,48116
1003,Willard,Kolmetz,NJ,8014
1004,Maryann,Royster,AK,99501
1005,Alisha,Slusarski,OH,45011
1006,Allene,Iturbide,OH,44805
1007,Chanel,Caudy,IL,60632
1008,Ezekiel,Chui,CA,95111
1009,Willow,Kusko,SD,57105
1010,Bernardo,Figeroa,MD,21224
We will load “department.txt” from the local filesystem into HDFS “/pigexample/” using the below commands.
Command:
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/department.txt /pigexample/
Now we will create a relation and load data from HDFS to Pig.
Command:
grunt> deptdata = LOAD '/pigexample/department.txt' USING PigStorage(',') as (deptid:int,empname:chararray,city:chararray,state:chararray,zip:int );
1. Filter Operator
Filter operator is used to selects tuples from a relation based on some condition.
Syntax:
grunt> alias = FILTER alias BY expression;
We will use filter condition to select city == LA from relation “deptdata” and using the DUMP operator we will print records on the terminal.
Command:
grunt> filterdata = FILTER deptdata BY state = 'LA';
grunt> DUMP filterdata;
Output:
2. Distinct Operator
2. Distinct Operator
The DISTINCT operator is used to remove the duplicate tuples in a relation.
Syntax:
grunt> alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n];
We will use a distinct operator to remove duplicate tuples from relation “deptdata” and using the DUMP operator we will print records on the terminal.
Command:
grunt> distinctdata = DISTINCT deptdata;
grunt> DUMP distinctdata;
Output:
3. Foreach Operator
3. Foreach Operator
FOREACH operator generates the transformation data which is based on the data of columns.
Syntax:
grunt> alias = FOREACH { block | nested_block };
We will use the FOREACH operator to select ‘empname,’ ‘city’ and ‘state’ from relation “deptdata” and save into other relation “and using DUMP operator we will print records on the terminal.