Apache Pig Diagnostic Operators

Diagnostic operators are used to verifying the execution flow, displaying data on the terminal, checking logical and physical execution plans of MapReduce, and so on.

Pig provides the below list of Diagnostic Operators.

DUMP Operator
DESCRIBE Operator
EXPLAIN Operator
ILLUSTRATE Operator

Let us see each Diagnostic operator in detail.

1. DUMP Operator

In Pig, the DUMP operator is used to projecting the result of Pig Latin on the terminal.

Syntax:

grunt> DUMP alias;

To perform the DUMP operation, first, we will load a file name “employee.txt” in HDFS “/pigexample/” location. The content of the file is mentioned below.



    1001,James,Butt,New Orleans,Orleans
    1002,Josephine,Darakjy,Brighton,Livingston
    1003,Art,Venere,Bridgeport,Gloucester
    1004,Lenna,Paprocki,Anchorage,Anchorage
    1005,Donette,Foller,Hamilton,Butler
    1006,Simona,Morasca,Ashland,Ashland
    1007,Mitsue,Tollner,Chicago,Cook
    1008,Leota,Dilliard,San Jose,Santa
    1009,Sage,Wieser,Sioux Falls,Minnehaha

Command:

grunt> employees = LOAD '/pigexample/employee.txt' USING PigStorage(',') as (empid:int,firstname:chararray,lastname:chararray,city:chararray,county:chararray );

Once data is loaded in Pig, we will use the Dump operator to print the content of the “employee.txt” file.

Command:

grunt> DUMP employees;

A MapReduce job will start when we will submit the Dump command. It will read data from HDFS and show the below output.

Output:

2. DESCRIBE Operator

DESCRIBE operator is used to viewing the structure of a schema.

Syntax:

grunt> DESCRIBE alias;

We will use DESCRIBE operator to see the structure of schema “employees” which was created in the DUMP section.

Command:

grunt> DESCRIBE employees;

Output:

3. EXPLAIN Operator

EXPLAIN operator is used to reviewing the logical, physical, and MapReduce execution plans that are used to compute the specified relationship.

Syntax:

grunt> EXPLAIN alias;

We will use EXPLAIN operator to see the logical, physical, and MapReduce execution plans of schema “employees” which was created in the DUMP section.

Command:

grunt> EXPLAIN employees;

Output:



    grunt> EXPLAIN employees;
    2020-07-11 03:19:35,289 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was
    not set... will not generate code.
    2020-07-11 03:19:35,446 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
    {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter,
    LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer,
    PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter,
    StreamTypeCastInserter]}
    #-----------------------------------------------
    # New Logical Plan:
    #-----------------------------------------------
    employees: (Name: LOStore Schema:
    empid#11:int,firstname#12:chararray,lastname#13:chararray,city#14:chararray,county#15:chararray)
    |
    |---employees: (Name: LOForEach Schema:
    empid#11:int,firstname#12:chararray,lastname#13:chararray,city#14:chararray,county#15:chararray)
    | |
    | (Name: LOGenerate[false,false,false,false,false] Schema:
    empid#11:int,firstname#12:chararray,lastname#13:chararray,city#14:chararray,county#15:chararray)ColumnPrune:OutputUids=[11,
    12, 13, 14, 15]ColumnPrune:InputUids=[11, 12, 13, 14, 15]
    | | |
    | | (Name: Cast Type: int Uid: 11)
    | | |
    | | |---empid:(Name: Project Type: bytearray Uid: 11 Input: 0 Column: (*))
    | | |
    | | (Name: Cast Type: chararray Uid: 12)
    | | |
    | | |---firstname:(Name: Project Type: bytearray Uid: 12 Input: 1 Column: (*))
    | | |
    | | (Name: Cast Type: chararray Uid: 13)
    | | |
    | | |---lastname:(Name: Project Type: bytearray Uid: 13 Input: 2 Column: (*))
    | | |
    | | (Name: Cast Type: chararray Uid: 14)
    | | |
    | | |---city:(Name: Project Type: bytearray Uid: 14 Input: 3 Column: (*))
    | | |
    | | (Name: Cast Type: chararray Uid: 15)
    | | |
    | | |---county:(Name: Project Type: bytearray Uid: 15 Input: 4 Column: (*))
    | |
    | |---(Name: LOInnerLoad[0] Schema: empid#11:bytearray)
    | |
    | |---(Name: LOInnerLoad[1] Schema: firstname#12:bytearray)
    | |
    | |---(Name: LOInnerLoad[2] Schema: lastname#13:bytearray)
    | |
    | |---(Name: LOInnerLoad[3] Schema: city#14:bytearray)
    | |
    | |---(Name: LOInnerLoad[4] Schema: county#15:bytearray)
    |
    |---employees: (Name: LOLoad Schema:
    empid#11:bytearray,firstname#12:bytearray,lastname#13:bytearray,city#14:bytearray,county#15:bytearray)RequiredFields:null
    2020-07-11 03:19:35,804 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap
    (Tenured Gen) of size 699072512 to monitor. collectionUsageThreshold = 489350752, usageThreshold =
    489350752
    #-----------------------------------------------
    # Physical Plan:
    #-----------------------------------------------
    employees: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-17
    |
    |---employees: New For Each(false,false,false,false,false)[bag] - scope-16
    | |
    | Cast[int] - scope-2
    | |
    | |---Project[bytearray][0] - scope-1
    | |
    | Cast[chararray] - scope-5
    | |
    | |---Project[bytearray][1] - scope-4
    | |
    | Cast[chararray] - scope-8
    | |
    | |---Project[bytearray][2] - scope-7
    | |
    | Cast[chararray] - scope-11
    | |
    | |---Project[bytearray][3] - scope-10
    | |
    | Cast[chararray] - scope-14
    | |
    | |---Project[bytearray][4] - scope-13
    |
    |---employees: Load(/pigexample/employee.txt:PigStorage(',')) - scope-0

    2020-07-11 03:19:36,167 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold:
    100 optimistic? false
    2020-07-11 03:19:36,298 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before
    optimization: 1
    2020-07-11 03:19:36,299 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after
    optimization: 1
    #--------------------------------------------------
    # Map Reduce Plan
    #--------------------------------------------------
    MapReduce node scope-18
    Map Plan
    employees: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-17
    |
    |---employees: New For Each(false,false,false,false,false)[bag] - scope-16
    | |
    | Cast[int] - scope-2
    | |
    | |---Project[bytearray][0] - scope-1
    | |
    | Cast[chararray] - scope-5
    | |
    | |---Project[bytearray][1] - scope-4
    | |
    | Cast[chararray] - scope-8
    | |
    | |---Project[bytearray][2] - scope-7
    | |
    | Cast[chararray] - scope-11
    | |
    | |---Project[bytearray][3] - scope-10
    | |
    | Cast[chararray] - scope-14
    | |
    | |---Project[bytearray][4] - scope-13
    |
    |---employees: Load(/pigexample/employee.txt:PigStorage(',')) - scope-0--------
    Global sort: false
    ----------------

4. ILLUSTRATE Operator

ILLUSTRATE operator is used to displaying step-by-step execution of a sequence of Pig Latin statements.

Syntax:

grunt> ILLUSTRATE alias;

We will use ILLUSTRATE operator to see the step-by-step execution of a sequence of schema “employees” which was created in the DUMP section.

Command:

grunt> ILLUSTRATE employees;

Output:

The following output is shown if we use ILLUSTRATE employees command.



    2020-07-11 03:28:11,993 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap
    (Tenured Gen) of size 699072512 to monitor. collectionUsageThreshold = 489350752, usageThreshold =
    489350752
    2020-07-11 03:28:11,994 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has
    already been initialized
    2020-07-11 03:28:11,995 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed
    per job phase (AliasName[line,offset]): M: employees[1,12] C: R:
    ---------------------------------------------------------------------------------------------------------------------------
    | employees | empid:int | firstname:chararray | lastname:chararray | city:chararray | county:chararray |
    ---------------------------------------------------------------------------------------------------------------------------
    | | 1008 | Leota | Dilliard | San Jose | Santa |
    ---------------------------------------------------------------------------------------------------------------------------

Again if we run “ILLUSTRATE employees;” we get another output because it is displaying execution step by step.



    2020-07-11 03:28:19,644 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap
    (Tenured Gen) of size 699072512 to monitor. collectionUsageThreshold = 489350752, usageThreshold =
    489350752
    2020-07-11 03:28:19,655 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has
    already been initialized
    2020-07-11 03:28:19,656 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed
    per job phase (AliasName[line,offset]): M: employees[1,12] C: R:
    ---------------------------------------------------------------------------------------------------------------------------
    | employees | empid:int | firstname:chararray | lastname:chararray | city:chararray | county:chararray |
    ---------------------------------------------------------------------------------------------------------------------------
    | | 1014 | Yuki | Whobrey | Rockford | Winnebago |
    ---------------------------------------------------------------------------------------------------------------------------

Apache Pig Diagnostic Operators

Pig - Load & Store Operators

Pig - Grouping & Joining

1. DUMP Operator

Syntax:

Command:

Command:

Output:

2. DESCRIBE Operator

Syntax:

Command:

Output:

3. EXPLAIN Operator

Syntax:

Command:

Output:

4. ILLUSTRATE Operator

Syntax:

Command:

Output: