Top 30 Apache Oozie Question and Answers
1. What is Apache Oozie?
Apache Oozie is a workflow scheduling tool that is used to schedule Hadoop jobs. It is based on Java web software and provides support for Apache MapReduce, Hive, Pig, and Sqoop. Apache Oozie can be scalable and works as an extensible system that runs more than 200,000 jobs each day for Yahoo! manufacturing department.
2. What are the important features of Apache Oozie?
Oozie provides client API and command-line interface which can be used to launch, control, and monitor job from Java application.
- It provides Web Service APIs to control jobs from anywhere.
- It provides the provision to execute jobs that are scheduled to run periodically.
- It provides the provision to send email notifications upon completion of jobs.
3. What is the use of Apache Oozie?
Apache Oozie provides a better way to handle multiple jobs. There are jobs that the user wants to run at a particular time or have a dependency on other jobs. Such kind of execution is made easy with Oozie. An administrator can use Oozie to run back to a back job, a sequence job, and control it from anywhere.
4. What are the main components of the Apache Oozie workflow?
The Apache Oozie workflow contains two main components.
- Control flow nodes: These nodes are used to define the start and end of the workflow also used to manage the execution path in workflow.
- Action nodes: These nodes are used to start the execution of the processing or computation task. Oozie supports actions such as Hadoop MapReduce, Pig, and File system, and system-specific jobs such as HTTP, SSh, and email.
5. What is the use of Fork and Join nodes in Apache Oozie?
The fork and join nodes are used in pairs. The fork node is used to spill the execution of the path in many concurrent paths whereas the join nodes join the two or more concurrent execution paths into a single one.
6. What are the important EL functions present in the Oozie workflow?
The following are some important EL functions of Oozie workflow.
- wf: name() This function is used to return the application name in the workflow.
- wf: id() This function is used to return the job id of the currently running workflow job.
- wf:errorCode(String node) This function is used to return the error code of the executing action node.
- wf:lastErrorNod() This function is used to return the name of the last executed action node.
7. What are the actions that can be performed in Oozie?
The following action nodes are supported by Apache Oozie.
- MapReduce Action
- Java Action
- Pig Action
- FS Action
- Sub-Workflow Action
- Hive Action
- DistCp Action
- Email Action
- Shell Action
- SSH Action
- Sqoop Action
8. What is Oozie Bundle?
Oozie bundle provides the facility to start a job in batches. Bundles jobs can be started stopped, suspended, resumed, re-run, or killed in batches and this way bundle jobs provide better control.
9. What is the lifecycle of the Oozie workflow job?
The following are states through which an Oozie workflow transition.
- PREP: In this state, the user creates a workflow job.
- RUNNING: In this state, a job is running.
- SUSPENDED: In this state, a job is suspended.
- SUCCEEDED: In this state, a job is reached to the end node.
- KILLED: In this state, a job transition is killed.
- FAILED: In this state, a job is a failed state due to an unexpected error.
10. Why there is Oozie security?
Oozie provides security because the User is not allowed to alter the job of another user and Hadoop does not support the authentication of the end-user so Oozie performs the verification of users and then forwards the jobs to Hadoop.
11. Is the cycle supported by Apache Oozie workflow?
Apache Oozie does not support cycles as it supports DAG. During workflow application deployment if Oozie detects a cycle then the workflow fails.
12. Do we have alternatives to Apache Oozie?
The following are the alternative to Apache Oozie.
- Apache NiFi
- Apache Azkaban
- Apache Falcon
13. What are the files contains in the workflow application?
The workflow applications contain the following files.
- Configuration file – config-default.xml
- Pig scripts
- Application files /lib directory and JAR & SO files
14. What are the different types of Oozie jobs?
Oozie supports job scheduling for Apache MapReduce, Hive, Sqoop, and Pig. It has two parts.
- Workflow engine: The responsibility of the workflow engine is to run Hadoop composed jobs.
- Coordinator engine: This will run workflow jobs based on predefined schedules and data availability.
15. What are the important properties which we need to mention in .properties?
The following are the properties which we can mention in .properties.
- Name Node
- Job Tracker
- Oozie.wf.application.path
- Lib Path
- Jar Path
16. What are the important files required while running a Hive action in Oozie?
The following files are required.
- hive.hql
- hive-site.xml
17. What Is Application Pipeline In Oozie?
The output of multiple workflows becomes an input to the next workflow and chaining these workflow results is called a data application pipeline.
18. What is the role of the decision node in Oozie?
Decision nodes are called switch statements that run different jobs based on another expression outcome.
19. What are the different types of Oozie jobs?
The following is a list of different types of Oozie jobs.
- Oozie Workflow
- Oozie coordinator
- Oozie Bundle
20. What is the retention of the Oozie log file?
The retention of the Oozie log file is 30 days or up to a total of 720 log files are generated.
21. What is the command-line option to check the status of workflow/coordinator or bundle action in Oozie?
The following is the command to check the status of workflow/coordinator or bundle action in Oozie.
$ oozie job -oozie http://localhost:8080/oozie -info <>
22. What is Workflow in Apache Oozie?
Apache Oozie workflow is a collection of action and controls nodes that are arranged in a DAG. DAG is a directed acyclic graph (DAG) that captures control dependency in which each action represents a Hadoop job, Pig, Hive, Sqoop, or Hadoop DistCp job. Apart from Hadoop jobs, there are other actions such as Java application, a shell script, or email notification.
23. What is the Coordinator in Apache Oozie?
Apache Oozie coordinator is used to resolve trigger-based workflow execution. It provides a simple framework to give triggers or predicts and then it schedules the workflow based on those predefined triggers. It helps administrators to monitor and control the workflow execution depending on cluster conditions and application-specific limits.
24. What is Bundle in Apache Oozie?
Apache Oozie Bundle is a collection of Oozie coordinator applications that include the direction on when to start that coordinator. Users can start, stop, resume, suspend, and rerun at a bundle level that provides good control. Bundles are also defined via an XML-based language called the Bundle Specification Language. It is a very useful level of abstraction in many large enterprises.
25. What is the default database name which Oozie uses to store job ids and job status?
Oozie uses the Derby database to store job ids and job status.
26. What is the use of sub-workflow action in Oozie?
The sub-workflow action is comprised of the workflow action element, it runs a child workflow job. A child workflow can reside either in the same Oozie system or it can reside on other Oozie systems. The parent workflow job will wait until the child workflow job has been completed.
27. What kind of application is Oozie?
Oozie is a server-based application that holds an embedded Tomcat server.
28. What is the command-line syntax to submit a Workflow/Coordinator or Bundle job?
The following command will submit a Workflow/Coordinator or Bundle job.
$ oozie job -oozie http://localhost:8080/oozie -config job.properties -submit jobname or jobid
29. What is the command-line syntax to start a Workflow/Coordinator or Bundle job?
The following command will start a Workflow/Coordinator or Bundle job.
$ oozie job -oozie http://localhost:8080/oozie -start jobname or jobid
30. What is the command to get the status of all running Oozie workflow?
The following command is used to get the status of all running Oozie workflows?
$ oozie job -filter status=RUNNING -len 1000 -oozie http://localhost:11000/oozie