Compiling Hadoop Code
These instructions will get you to the point that you can compile a JAR on a Linux machine at the command line. You can then run it on your machine or on our Cluster.
Setup
See the Platform page for information on setting up your system.
Start by downloading the Hadoop release. Get the “binary” version in the version you want.
Unpack the Hadoop release somewhere that we'll refer to as HADOOP_HOME
(adjust the code below for the place you put the files). Hadoop also likes to have JAVA_HOME
set:
export HADOOP_HOME=/home/me/hadoop-3.3.4
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
CSIL Linux
On the lab computers, the right values are:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/shared/CMPT/big-data/hadoop-3.3.1
On the Cluster Gateway
If you need to compile code on cluster.cs.sfu.ca, you don't need to set those environment variables, but see the alternate versions of the commands below.
Compiling
Now you can compile your Java code (adding more javac
calls as necessary) and create a JAR like this:
${JAVA_HOME}/bin/javac -classpath `${HADOOP_HOME}/bin/hadoop classpath` WordCount.java
${JAVA_HOME}/bin/jar cf wordcount.jar WordCount*.class
On the Cluster Gateway
It will be easier to compile code on your local machine, but it's possible to do it on cluster.cs.sfu.ca with commands like:
javac -classpath `hadoop classpath` WordCount.java
jar cf wordcount.jar WordCount*.class
Running Locally
You should be able to run the job on your computer (or one of the computers in the lab) with a command like:
${HADOOP_HOME}/bin/yarn jar wordcount.jar WordCount \
wordcount-1 output-1
less output-1/part-*
Running on the Cluster
You can transfer this JAR file to the cluster like this (but see the Cluster instructions for more details):
scp wordcount.jar <USERID>@cluster.cs.sfu.ca:
And on the cluster run it:
yarn jar wordcount.jar WordCount \
/courses/732/wordcount-1 output-1
hdfs dfs -cat output-1/part-* | less
Adding JARs
If you have additional dependencies in .jar
files, you can tell the Hadoop tools about them:
export HADOOP_CLASSPATH=/path/to/jar1.jar:/path/to/jar2.jar
And then compile as above. (The hadoop classpath
command checks that variable and adds them to your compilation.)
When you run your code on the cluster, you need to tell YARN that the extra .jar
file needs to be distributed to the nodes during execution. That is done with the -libjars
argument like this:
yarn jar jarfile.jar ClassName -libjars /path/to/jar1.jar,/path/to/jar2.jar arg0 arg1