Compiling Hadoop Code

These instructions will get you to the point that you can compile a JAR on a Linux machine at the command line. You can then run it on your machine or on our Cluster.

Setup

See the Platform page for information on setting up your system.

Start by downloading the Hadoop release. Get the “binary” version in the version you want.

Unpack the Hadoop release somewhere that we'll refer to as HADOOP_HOME (adjust the code below for the place you put the files). Hadoop also likes to have JAVA_HOME set:

export HADOOP_HOME=/home/me/hadoop-3.3.4
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

CSIL Linux

On the lab computers, the right values are:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/shared/CMPT/big-data/hadoop-3.3.1

On the Cluster Gateway

If you need to compile code on cluster.cs.sfu.ca, you don't need to set those environment variables, but see the alternate versions of the commands below.

Compiling

Now you can compile your Java code (adding more javac calls as necessary) and create a JAR like this:

${JAVA_HOME}/bin/javac -classpath `${HADOOP_HOME}/bin/hadoop classpath` WordCount.java
${JAVA_HOME}/bin/jar cf wordcount.jar WordCount*.class

On the Cluster Gateway

It will be easier to compile code on your local machine, but it's possible to do it on cluster.cs.sfu.ca with commands like:

javac -classpath `hadoop classpath` WordCount.java
jar cf wordcount.jar WordCount*.class

Running Locally

You should be able to run the job on your computer (or one of the computers in the lab) with a command like:

${HADOOP_HOME}/bin/yarn jar wordcount.jar WordCount \
    wordcount-1 output-1
less output-1/part-*

Running on the Cluster

You can transfer this JAR file to the cluster like this (but see the Cluster instructions for more details):

scp wordcount.jar <USERID>@cluster.cs.sfu.ca:

And on the cluster run it:

yarn jar wordcount.jar WordCount \
    /courses/732/wordcount-1 output-1
hdfs dfs -cat output-1/part-* | less

Adding JARs

If you have additional dependencies in .jar files, you can tell the Hadoop tools about them:

export HADOOP_CLASSPATH=/path/to/jar1.jar:/path/to/jar2.jar

And then compile as above. (The hadoop classpath command checks that variable and adds them to your compilation.)

When you run your code on the cluster, you need to tell YARN that the extra .jar file needs to be distributed to the nodes during execution. That is done with the -libjars argument like this:

yarn jar jarfile.jar ClassName -libjars /path/to/jar1.jar,/path/to/jar2.jar arg0 arg1

Updated Thu Aug. 22 2024, 11:06 by ggbaker.

Simon Fraser University
Engaging the World

CourSys

Compiling Hadoop Code

Setup

CSIL Linux

On the Cluster Gateway

Compiling

On the Cluster Gateway

Running Locally

Running on the Cluster

Adding JARs