Apache Pig provides Filter operators to select required data by putting a condition on the relation.
Apache Pig supports the below list of filter operators.
We have used the “department.txt” dataset to perform these operations. We will put “department.txt” in HDFS location “/pigexample/” from the local file system.
Content of “department.txt”:
1001,Bette,Nicka,LA,70116
1002,Veronika,Inouye,MI,48116
1003,Willard,Kolmetz,NJ,8014
1004,Maryann,Royster,AK,99501
1005,Alisha,Slusarski,OH,45011
1006,Allene,Iturbide,OH,44805
1007,Chanel,Caudy,IL,60632
1008,Ezekiel,Chui,CA,95111
1009,Willow,Kusko,SD,57105
1010,Bernardo,Figeroa,MD,21224
We will load “department.txt” from the local filesystem into HDFS “/pigexample/” using the below commands.
Command:
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/department.txt /pigexample/
Now we will create a relation and load data from HDFS to Pig.
Command:
grunt> deptdata = LOAD '/pigexample/department.txt' USING PigStorage(',') as (deptid:int,empname:chararray,city:chararray,state:chararray,zip:int );
1. Filter Operator
Filter operator is used to selects tuples from a relation based on some condition.
Syntax:
grunt> alias = FILTER alias BY expression;
We will use filter condition to select city == LA from relation “deptdata” and using the DUMP operator we will print records on the terminal.
Command:
grunt> filterdata = FILTER deptdata BY state = 'LA';
grunt> DUMP filterdata;
Output:
2. Distinct Operator
data:image/s3,"s3://crabby-images/8f4cb/8f4cba004372a2057b998d7194c159a29c37e974" alt="filter operator example"
2. Distinct Operator
The DISTINCT operator is used to remove the duplicate tuples in a relation.
Syntax:
grunt> alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n];
We will use a distinct operator to remove duplicate tuples from relation “deptdata” and using the DUMP operator we will print records on the terminal.
Command:
grunt> distinctdata = DISTINCT deptdata;
grunt> DUMP distinctdata;
Output:
3. Foreach Operator
data:image/s3,"s3://crabby-images/9bb71/9bb718c77ec45cdc403b04af99701d967255726b" alt="distinct operator example"
3. Foreach Operator
FOREACH operator generates the transformation data which is based on the data of columns.
Syntax:
grunt> alias = FOREACH { block | nested_block };
We will use the FOREACH operator to select ‘empname,’ ‘city’ and ‘state’ from relation “deptdata” and save into other relation “and using DUMP operator we will print records on the terminal.
Command:
grunt> foreachdata = FOREACH deptdata GENERATE empname,city,state;
grunt> DUMP foreachdata;
Output:
data:image/s3,"s3://crabby-images/03d95/03d95f362bb5e5428f4c562308065aabc47a6262" alt="foreach example"