Skip to main content

Set up Oozie Scheduling

Set up Oozie scheduling

Oozie is a workflow management tool provided when the Hadoop Eco cluster type is Core Hadoop.
With Oozie, you can view the list of bundles, workflows, coordinators, details, and logs. Oozie can be accessed by clicking the quick link in the cluster detail page. The following outlines how to set up Oozie scheduling.

Cluster TypeConnection Port
Standard (Single)Port 11000 of Master Node 1
HAPort 11000 of Master Node 3

Image Oozie Workflow List

Image Oozie Workflow Job Information

Image Oozie Workflow Job Details

Prepare

To execute Oozie workflow jobs, you need the workflow.xml file and additional executable files. Upload the prepared files to HDFS and specify the paths in the wf.properties file to execute them. An example for Hive jobs is as follows.

Hive Job cluster_meta_configuration.json example
$ hadoop fs -ls hdfs:///wf_hive/
Found 3 items
-rw-r--r-- 2 ubuntu hadoop 22762 2022-03-30 05:11 hdfs:///wf_hive/hive-site.xml
-rw-r--r-- 2 ubuntu hadoop 168 2022-03-30 05:11 hdfs:///wf_hive/sample.hql
-rw-r--r-- 2 ubuntu hadoop 978 2022-03-30 05:11 hdfs:///wf_hive/workflow.xml

$ oozie job -run -config wf.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/oozie-5.2.1/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/oozie-5.2.1/lib/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
job: 0000000-220330040805876-oozie-ubun-W

$ oozie job -info 0000000-220330040805876-oozie-ubun-W
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/oozie-5.2.1/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/oozie-5.2.1/lib/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
Job ID : 0000000-220330040805876-oozie-ubun-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : workflow_sample_job
App Path : hdfs:///wf_hive
Status : SUCCEEDED
Run : 0
User : ubuntu
Group : -
Created : 2022-03-30 05:12 GMT
Started : 2022-03-30 05:12 GMT
Last Modified : 2022-03-30 05:13 GMT
Ended : 2022-03-30 05:13 GMT
CoordAction ID: -

Actions
------------------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0000000-220330040805876-oozie-ubun-W@:start: OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0000000-220330040805876-oozie-ubun-W@hive_action OK application_1648613240828_0002SUCCEEDED -
------------------------------------------------------------------------------------------------------------------------------------
0000000-220330040805876-oozie-ubun-W@end OK - OK -
------------------------------------------------------------------------------------------------------------------------------------

Run oozie workflow

You can set up Oozie scheduling by running an Oozie workflow.

  1. Prepare the workflow.xml and related files for execution.

  2. Create a folder in HDFS and upload the related files.

    Upload files to HDFS
    hadoop fs -put
  3. Set the execution path in the wf.properties file to the uploaded path.

    • Set oozie.wf.application.path
  4. Run the Oozie job.

    Run oozie job
    - oozie job -run -config wf.properties
  5. Check the results.

    Check oozie results
    - oozie job -info [workflow id]

Hive job example

For Standard (Single) and High Availability (HA) types, only the resource manager and namenode values differ in the workflow, and the rest are the same.

workflow.xml
<workflow-app xmlns="uri:oozie:workflow:1.0" name="workflow_sample_job">
<start to="hive_action" />

<action name="hive_action">
<hive xmlns="uri:oozie:hive-action:1.0">
<resource-manager>hadoopmst-hadoop-single-1:8050</resource-manager>
<name-node>hdfs://hadoopmst-hadoop-single-1</name-node>
<job-xml>hive-site.xml</job-xml>
<configuration>
<property>
<name>hive.tez.container.size</name>
<value>2048</value>
</property>
<property>
<name>hive.tez.java.opts</name>
<value>-Xmx1600m</value>
</property>
</configuration>
<script>sample.hql</script>
</hive>
<ok to="end" />
<error to="kill" />
</action>

<kill name="kill">
<message>Error!!</message>
</kill>

<end name="end" />

</workflow-app>
sample.hql
$ cat sample.hql
create table if not exists t1 (col1 string);
insert into table t1 values ('a'), ('b'), ('c');
select col1, count(*) from t1 group by col1;
show tables;
show databases;
</workflow-app>
wf.properties
oozie.use.system.libpath=true
oozie.wf.application.path=hdfs:///wf_hive
user.name=ubuntu

Spark job example

For Standard (Single) and High Availability (HA) types, the resource manager and namenode values differ in the workflow, and there are differences in the values passed through spark-opts. Please be cautious of these differences when executing.

workflow.xml
<workflow-app xmlns="uri:oozie:workflow:1.0" name="workflow_sample_job">
<start to="spark_action" />

<action name="spark_action">
<spark xmlns="uri:oozie:spark-action:1.0">
<resource-manager>hadoopmst-hadoop-single-1:8050</resource-manager>
<name-node>hdfs://hadoopmst-hadoop-single-1</name-node>
<master>yarn-client</master>
<name>Spark Example</name>
<class>org.apache.spark.examples.SparkPi</class>
<jar>/opt/spark/examples/jars/spark-examples_2.11-2.4.6.jar</jar>
<spark-opts>--executor-memory 2G --conf spark.hadoop.yarn.resourcemanager.address=hadoopmst-hadoop-single-1:8050 --conf spark.yarn.stagingDir=hdfs://hadoopmst-hadoop-single-1/user/ubuntu --conf spark.yarn.appMasterEnv.HADOOP_CONF_DIR=/etc/hadoop/conf --conf spark.io.compression.codec=snappy</spark-opts>
<arg>100</arg>
</spark>
<ok to="end" />
<error to="kill" />
</action>

<kill name="kill">
<message>Error!!</message>
</kill>

<end name="end" />

</workflow-app>
wf.properties
oozie.use.system.libpath=true
oozie.wf.application.path=hdfs:///wf_spark
user.name=ubuntu

Shell job example

For Standard (Single) and High Availability (HA) types, only the resource manager and namenode values differ in the workflow, and the rest are the same.

workflow.xml
<workflow-app xmlns='uri:oozie:workflow:1.0' name='shell-wf'>
<start to='shell1' />
<action name='shell1'>
<shell xmlns="uri:oozie:shell-action:1.0">
<resource-manager>hadoopmst-hadoop-single-1:8050</resource-manager>
<name-node>hdfs://hadoopmst-hadoop-single-1</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<exec>echo.sh</exec>
<argument>A</argument>
<argument>B</argument>
<file>echo.sh#echo.sh</file>
</shell>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Script failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
wf.properties
oozie.use.system.libpath=true
oozie.wf.application.path=hdfs:///wf_shell
user.name=ubuntu