Not logged in. Login

Compiling Hadoop Code with Eclipse

Setup

Start by downloading the Hadoop release. Get the “binary” version in the version you want. Our cluster is currently running 3.3.4, so it might be wise to match that. Unpack the Hadoop release somewhere. In the lab machines, this can be found in the /usr/shared/CMPT/big-data/ folder.

Create a new project in Eclipse for your work. Get to the classpath editor: Project → Properties → Java Build Path page → Libraries tab (or see more help on setting the classpath).

Click “Add External JARs...”. You will need to add at least:

  • share/hadoop/common/hadoop-common-3.3.4.jar
  • share/hadoop/mapreduce/hadoop-mapreduce-client-*.jar
  • share/hadoop/common/lib/guava-*.jar
  • share/hadoop/common/lib/commons-logging-*.jar

Compiling and Running

Create a JAR file containing your project: File → Export... → JAR File.

Copy that to the cluster. The Unix-ish command to do that is something like this (but see the Cluster instructions for more details):

scp WordCount.jar <USERID>@cluster.cs.sfu.ca:

And then on the cluster login node, run the job as usual:

yarn jar ~/WordCount.jar WordCount wordcount-1 output-1 # no package name
yarn jar ~/WordCount.jar ca.sfu.whatever.WordCount wordcount-1 output-1 # with package name
hdfs dfs -cat output-1/part-* | less

Creating and Copying JAR Automatically

Eclipse needs to be set up to know how to do a SCP transfer. You will need the JAR for JSch JAR saved somewhere that makes you happy. In Eclipse, select Window → Preferences → Ant → Runtime → Classpath → Global Entries → Add External JARs and select the JSch JAR.

Add a file build.xml to the root directory of your project and add:

<project name="My Project" default="createjar">    
  <property name="projectHome" location="." />
  <target name="createjar">
    <jar destfile="${projectHome}/WordCount.jar" basedir="${projectHome}/bin" />
    <scp localfile="wordcount.jar"
    todir="[USERID]@cluster.cs.sfu.ca:"
    keyfile="/home/[USERID]/.ssh/id_rsa" passphrase="[KEYFILE_PASSPHRASE]"/>
  </target>    
</project>

In the project properties window, select Builders → New... → Ant Builder. For the Buildfile, Browse Workplace and select the build.xml file you just created.

After this, ctrl-B should build and upload the JAR file, ready to run the job.

Interactive debugging of Hadoop code

It is possible to interactively debug and step through your Hadoop code in eclipse. To avoid class definition not found errors, it is easiest to just put all Hadoop JARs on the build path. Below is a way to do that without polluting your own original project:

  • Create a new Java Project without source code, but add the following to its build path: a) under the Library tab pick all the JAR files that the hadoop release provides in its subfolders, and b) in the Projects tab pick your other project(s) in the workspace that you'd like to debug.
  • Create a new Debug Configuration for this combined project and choose as main class org.apache.hadoop.util.RunJar. As Arguments for this class provide the command line input that would usually follow yarn jar to start your java class from your specified jar file.
  • Place a breakpoint in your code and use the debug task to start debugging.

The shell script creates symbolic links to all of Hadoop's JAR files so they can be added to the build path more easily. It also contains some more detailed instructions at the end.

Note, on OSX these links did not work for some people, but simply adding all JARs by hand did, which should also work under Windows.

#!/bin/bash

# create symlinks for all .jar files that are found recursively in $source_path

if [ "$#" -lt 1 ]; then
    echo "Illegal number of parameters"
    echo "Provide source path, such as \$HADOOP_HOME/share/hadoop, as input argument"
    # source_path=$HADOOP_HOME/share/hadoop
    exit -1
else
    source_path=$1
    echo "source path for JARs: $source_path"
fi

if [ "$#" -gt 1 ]; then
    target_path=$2
else
    target_path=$(pwd)/hadoop_alljar
fi
echo creating target path: $target_path
mkdir -p $target_path
cd $target_path
target_path=$(pwd)

cd $source_path
source_path=$(pwd)
echo "linking JARs"
for f in $(find . | grep '\.jar'); do
    # ignore the "link already exists" error messages for identical jars via -f
    ln -sf $source_path/$f $target_path/$(basename $f)
done
echo "done."

echo "
 To debug your Hadoop project in eclipse:
 * Add an empty hadoop-dbg Java Project to your eclipse workspace
 * Add all the hadoop jars in $target_path
   as External JARs to the build path of hadoop-dbg
 * Add your own project(s) of interest to the build path of hadoop-dbg
 * Create a new RunJar debug task (Run -> Debug Configurations -> Java Application) and set
   Project: hadoop-dbg
   Main class: org.apache.hadoop.util.RunJar
   (found this out by checking what yarn jar ... invokes)
   Arguments to the program same as for yarn jar:
     <your jar path> <your class name> <local input path> <local output path>

 * Place a breakpoint in your main function or wherever you like and start the Debug task locally

 * When you step through the call stack and are missing hadoop sources,
   attach them as External source folder in eclipse
   use /usr/shared/CMPT/big-data/hadoop-2.6.0-src in CSIL
   or download that folder for home use
   (e.g. tar/zip and scp -P 24 csil-cpu4.csil.sfu.ca:hadoopsrc.tgz .)
"
Updated Fri Aug. 25 2023, 15:55 by sbergner.