Compute Cluster
We have a relatively modest Hadoop cluster for this course: 4 nodes, 60 cores, 128GB memory, 16TB storage.
Connecting Remotely
The goal here is to connect to cluster.cs.sfu.ca
by SSH.
To connect to the cluster, you need to either be on the campus network (SFUNET-secure) or have the SFU VPN.
You generally just need to SSH to cluster.cs.sfu.ca port 24 (substituting whatever SSH method you use on your computer):
[yourcomputer]$ ssh -p24 <USERID>@cluster.cs.sfu.ca
[gateway]$
Once you're connected to the cluster gateway, you can start running spark-submit
(and hdfs
) commands.
With SSH Keys and Port Forwards
Once you have confirmed that you can connect, get things set up properly…
Create an SSH key (if you don't have one already) so you can log in without a password. Then copy your public key into .ssh/authorized_keys
on the server (with ssh-copy-id
or by appending to ~/.ssh/authorized_keys
).
Create (or add to) the ~/.ssh/config
file on your computer. With this config, you can simply ssh cluster.cs.sfu.ca
to connect. (bonus: tab-completion)
Host cluster.cs.sfu.ca
User <USERID>
Port 24
LocalForward 8088 controller.local:8088
LocalForward 9870 controller.local:9870
LocalForward 18080 controller.local:18080
Then you should be able to just:
ssh cluster.cs.sfu.ca
With this configuration, port forwards will let you connect (in a limited unauthenticated way) to the web interfaces:
- HDFS namenode: http://localhost:9870/
- YARN application master: http://localhost:8088/
- Spark job history: http://localhost:18080/
Copying Files
You will also frequently need to copy files to the cluster:
[yourcomputer]$ scp code.py cluster.cs.sfu.ca:
Or whatever your preferred SCP/SFTP method is.
Spark Applications
Once you have the code there, you can start jobs as usual with spark-submit
, and they will be sent to the cluster:
spark-submit code.py ...
Cleaning Up
If you have unnecessary files sitting around (especially large files created as part of an assignment), please clean them up with a command like this:
hdfs dfs -rm -r output*
It is possible that you have jobs running and consuming resources without knowing: maybe you created an infinite loop or otherwise have a job burning memory or CPU. You can list jobs running on the cluster like this:
yarn application -list
And kill a specific job:
yarn application -kill <APPLICATION_ID>