What are Apache Oozie Action Nodes?
Apache Oozie action nodes are used to define the jobs which are the separate units of work that are attached to make up the Oozie workflow. By using action nodes a workflow starts the execution of a computation/processing task of various types of Hadoop jobs.
Action Definition
Actions can be defined in the workflow XML by using a set of elements that are specific and relevant to that action type. Some of the elements are very common in all the action types such as the Pig action would require a "script" element, but at the same time, Java action will not. A workflow system can be customized for Hadoop that makes it easy and natural for users to define all these actions which are destined for processing various Hadoop tools.
Action Types
The following action nodes are supported by Apache Oozie.
- MapReduce Action
- Java Action
- Pig Action
- FS Action
- Sub-Workflow Action
- Hive Action
- DistCp Action
- Email Action
- Shell Action
- SSH Action
- Sqoop Action
Let us see each action type in detail.
1. MapReduce Action
The Map-Reduce action is used to start the Hadoop Map-Reduce job from a workflow. It can be configured to perform file system cleanup and directory creation before starting the map-reduce job. To run Hadoop/Map-Reduce jobs, we need to configure all the Hadoop JobConf properties.
The following is an order which should be considered while writing the action definition in workflows.
- job-tracker (compulsory)
- name-node (compulsory)
- prepare
- streaming or pipes
- job-xml
- configuration
- file
- archive
1.1 Streaming
Streaming jobs detail can be mentioned in the streaming element as these jobs run binaries or scripts and require executables. Some of the streaming jobs require Files found on HDFS to be available to the mapper/reducer scripts.
Streaming jobs supports the following elements.
- mapper
- reducer
- record-reader
- record-reader-mapping
- env
Let us see the example of streaming.
...
foo:8021
bar:8020
/bin/bash testarchive/bin/mapper.sh testfile
/bin/bash testarchive/bin/reducer.sh
mapred.input.dir
${input}
mapred.output.dir
${output}
stream.num.map.output.key.fields
3
/home/cloudduggu/testfile.sh#testfile
/home/cloudduggu/testarchive.jar#testarchive
...
1.2 Pipes
Pipes are used to running C++ programs more gracefully. Although it is not very famous. A user-defined program must be bundled with the workflow application. Certain pipe jobs need the file to be present on HDFS for mapper/reducer scripts and this is accomplished using file and archive elements. We can overrule Pipe properties by mentioning them in the job-XML file or by mentioning them in the configuration element.
Pipes jobs support the following elements.
- map
- reduce
- inputformat
- partitioner
- writer
- program
Let us see the example of Pipes.
...
foo:8021
bar:8020
/bin/bash testarchive/bin/mapper.sh testfile
/bin/bash testarchive/bin/reducer.sh
mapred.input.dir
${input}
mapred.output.dir
${output}
stream.num.map.output.key.fields
3
/home/cloudduggu/testfile.sh#testfile
/home/cloudduggu/testarchive.jar#testarchive
...
2. Java Action
Java action is used to run custom java code on Hadoop Cluster. It will execute the public static void main(String[] args) method of the specified main Java class. The Java applications are executed in the Hadoop cluster as a map-reduce job with a single Mapper task. A java action can be configured to perform HDFS files/directories clean up or HCatalog partitions clean up before starting the Java application. This helps Oozie to reinitiate a Java application in the condition of a transient or non-transient failure. Before starting the Java application, a java action can be configured to perform HDFS files/directories cleanup or HCatalog partitions cleanup.
Java action contains the following elements.
- job-tracker (compulsory)
- name-node (compulsory)
- prepare
- configuration
- main-class (compulsory)
- java-opts
- arg
- file
- archive
- capture-output
Let us see the example of Java action.
...
foo:8021
bar:8020
mapred.queue.name
default
org.apache.oozie.MyFirstMainClass
-Dblah
argument1
argument2
...
3. Pig Action
Pig action executes a Pig job in Hadoop. Pig uses a Latin programming language to run Hadoop jobs. Pig translates Pig script into MapReduce jobs for Hadoop. Before we start a Pig job, we can configure pig action to perform HDFS files/directories cleanup. This helps Oozie to reinitiate a Pig job in the condition of a transient failure.
Pig action contains the following elements.
- scrjob-tracker (compulsory)
- name-node (compulsory)
- prepare
- job-xml
- configuration
- script (compulsory)
- param
- argument
- file
- archive
Let us see the example of Pig action for Oozie schema 0.2.
...
foo:8021
bar:8020
mapred.compress.map.output
true
oozie.action.external.stats.write
true
-param
INPUT=${inputDir}
-param
OUTPUT=${outputDir}/pig-output3
...
4. FS Action
FS Action is used to perform manipulation of files and directories in HDFS from a workflow application. We can trigger FS commands synchronously within the FS action, once that command is completed, the workflow will start the next action.
FS action supports the following commands.
- move
- delete
- mkdir
- chmod
- touchz
- chgrp
Let us see the example of FS action.
...
hdfs://foo:8020
fs-info.xml
some.property
some.value
...
5. Sub-Workflow Action
The sub-workflow action triggers a child workflow as part of the parent workflow. In that case, a child workflow job can be presented in a similar Oozie system or it can be in another Oozie system. The parent workflow will only complete when the child workflow complete.
Sub-Workflow action contains the following elements.
- app-path (compulsory)
- propagate-configuration
- configuration
Let us see the example of Sub-Workflow action.
...
child-wf
input.dir
${wf:id()}/second-mr-output
...
6. Hive Action
Hive action is used to run Hive queries on the cluster. It is SQL alike interface for Hadoop and a very famous tool to work on Hadoop data. The Hive query and the related configuration, libraries, and code for user-defined functions have to be packaged as part of the workflow bundle and deployed to HDFS.
Hive action contains the following elements.
- job-tracker (compulsory)
- name-node (compulsory)
- prepare
- job-xml
- configuration
- script (compulsory)
- param
- argument
- file
- archive
Let us see the example of Hive action.
...
-hivevar
InputDir=/home/cloudduggu/input-data
-hivevar
OutputDir=${jobOutput}
7. DistCp Action
DistCp action is used to support Hadoop distributed copy tool that is used to copy data across the Hadoop cluster. It can be used to copy data in the same cluster as well as move data between Amazon S3 to Hadoop Cluster.
DistCp action contains the following elements.
- job-tracker (compulsory)
- name-node (compulsory)
- prepare
- configuration
- java-opts
- arg
Let us see the example of DistCp action.
...
hdfs://localhost:8020/path/to/input.txt
${nameNode2}/path/to/output.txt
8. Email Action
Email action is used to send email notifications using a workflow application. It takes as usual email parameters such as to, cc, subject, and body of the email.
Email action contains the following elements.
- to (compulsory)
- cc
- subject (compulsory)
- body (compulsory)
Let us see the example of Email action.
Apart from this, the following SMTP server configuration has to be defined in the oozie-site.xml file for this action to work.
- oozie.email.smtp.host (default: localhost)
- oozie.email.smtp.port (default: 25)
- oozie.email.from.address (default: oozie@localhost)
- oozie.email.smtp.auth (default: false)
- oozie.email.smtp.username (default: empty)
- oozie.email.smtp.password (default: empty)
Let us see the example of Email action.
support@cloudduggu.com
support@cloudduggu.com
Email notifications for ${wf:id()}
The wf ${wf:id()} successfully completed.
9. Shell Action
Shell action is used to run all shell script commands. The shell command runs on a random Hadoop cluster node and the commands being run should be available locally on that node.
Shell action contains the following elements.
- job-tracker (compulsory)
- name-node (compulsory)
- prepare
- job-xml
- configuration
- exec (compulsory)
- argument
- env-var
- file
- archive
- capture-output
Let us see the example of Shell action.
...
${EXEC}
A
B
${EXEC}#${EXEC}
10. SSH Action
SSH action is used to run shell commands on a specified remote host. The command should be executed on the remote machine and user's home directory.
SSH action contains the following elements.
- host (compulsory)
- command (compulsory)
- args
- arg
- capture-output
Let us see the example of SSH action.
foo@bar.com
uploaddata
jdbc:derby://bar.com:1527/myDB
hdfs://foobar.com:8020/usr/joe/myData
11. Sqoop Action
Sqoop action is used to run Sqoop jobs to import and export data from Hadoop to relational databases and from relational databases to the Hadoop system. Sqoop uses JDBC to talk to external database systems.
Sqoop action contains the following elements.
- job-tracker (compulsory)
- name-node (compulsory)
- prepare
- job-xml
- configuration
- command
- arg
- file
- archive
Let us see the example of Sqoop action.
...
import --connect jdbc:hsqldb:file:db.hsqldb --table test_table--target-dir
hdfs://localhost:8020/user/joe/sqoop_tbl -m 1