Compute Cluster

We have a relatively modest Hadoop cluster for this course: 4 nodes, 60 cores, 128GB memory, 16TB storage.

Connecting Remotely

The goal here is to connect to cluster.cs.sfu.ca by SSH.

To connect to the cluster, you need to either be on the campus network (SFUNET-secure) or have the SFU VPN.

You generally just need to SSH to cluster.cs.sfu.ca port 24 (substituting whatever SSH method you use on your computer):

[yourcomputer]$ ssh -p24 <USERID>@cluster.cs.sfu.ca 
[gateway]$

Once you're connected to the cluster gateway, you can start running spark-submit (and hdfs) commands.

With SSH Keys and Port Forwards

Once you have confirmed that you can connect, get things set up properly…

Create an SSH key (if you don't have one already) so you can log in without a password. Then copy your public key into .ssh/authorized_keys on the server (with ssh-copy-id or by appending to ~/.ssh/authorized_keys).

Create (or add to) the ~/.ssh/config file on your computer. With this config, you can simply ssh cluster.cs.sfu.ca to connect. (bonus: tab-completion)

Host cluster.cs.sfu.ca
  User <USERID>
  Port 24
  LocalForward 8088 controller.local:8088
  LocalForward 9870 controller.local:9870
  LocalForward 18080 controller.local:18080

Then you should be able to just:

ssh cluster.cs.sfu.ca

With this configuration, port forwards will let you connect (in a limited unauthenticated way) to the web interfaces:

HDFS namenode: http://localhost:9870/
YARN application master: http://localhost:8088/
Spark job history: http://localhost:18080/

Copying Files

You will also frequently need to copy files to the cluster:

[yourcomputer]$ scp code.py cluster.cs.sfu.ca:

Or whatever your preferred SCP/SFTP method is.

Spark Applications

Once you have the code there, you can start jobs as usual with spark-submit, and they will be sent to the cluster:

spark-submit code.py ...

Cleaning Up

If you have unnecessary files sitting around (especially large files created as part of an assignment), please clean them up with a command like this:

hdfs dfs -rm -r output*

It is possible that you have jobs running and consuming resources without knowing: maybe you created an infinite loop or otherwise have a job burning memory or CPU. You can list jobs running on the cluster like this:

yarn application -list

And kill a specific job:

yarn application -kill <APPLICATION_ID>

Updated Tue Aug. 29 2023, 10:33 by ggbaker.

Simon Fraser University
Engaging the World

CourSys