Compute Cluster
We have a relatively modest Hadoop cluster for this course: 4 nodes, 60 cores, 128GB memory, 16TB storage.
Connecting Remotely
The goal here is to connect to cluster.cs.sfu.ca
by SSH.
From On-Campus
[This will work if you are within the campus network (or have the SFU VPN).]
You will be connecting to the cluster a lot: you may want to get things set up more nicely to make your life easier later. But, this should at least work.
You generally just need to SSH to cluster.cs.sfu.ca (substituting whatever SSH method you use on your computer):
[yourcomputer]$ ssh -p24 <USERID>@cluster.cs.sfu.ca
[gateway]$
Once you're connected to the cluster gateway, you can start running spark-submit
(and hdfs
) commands.
If you need access to the web frontends in the cluster, you can do the initial SSH with a longer command including some port forwards:
ssh -p24 -L 9870:controller.local:9870 -L 8088:controller.local:8088 <USERID>@cluster.cs.sfu.ca
From On-Campus With SSH Keys
Once you have confirmed that you can connect, get things set up properly…
Create an SSH key (if you don't have one already) so you can log in without a password. Then copy your public key into .ssh/authorized_keys
on the server (with ssh-copy-id
or by appending to ~/.ssh/authorized_keys
).
Create (or add to) the ~/.ssh/config
file on your computer. With this config, you can simply ssh cluster.cs.sfu.ca
to connect. (bonus: tab-completion)
Host cluster.cs.sfu.ca
User <USERID>
Port 24
LocalForward 8088 controller.local:8088
LocalForward 9870 controller.local:9870
# Use this via `ssh clustergw` if you're connecting from off-campus without VPN
Host clustergw
HostName cluster.cs.sfu.ca
User <USERID>
Port 24
LocalForward 8088 controller.local:8088
LocalForward 9870 controller.local:9870
ProxyCommand ssh -p 24 gateway.csil.sfu.ca exec nc %h %p
With this configuration, port forwards will let you connect (in a limited unauthenticated way) to the web interfaces:
- HDFS namenode: http://localhost:9870/
- YARN application master: http://localhost:8088/
Copying Files
You will also frequently need to copy files to the cluster:
[yourcomputer]$ scp code.py <USERID>@cluster.cs.sfu.ca:
Or whatever your preferred SCP/SFTP method is.
From Off-Campus
From off-campus networks, you need an extra hop to get to the cluster. The most reliable is probably to ssh to gateway.csil.sfu.ca (port 24). We need to do a two-step port forward. On a Linux-like system, you can add this to your ~/.ssh/config
to forward from your computer to the gateway:
Host gateway.csil.sfu.ca
User <USERID>
Port 24
LocalForward 8088 localhost:8088
LocalForward 9870 localhost:9870
ServerAliveInterval 120
Then on gateway.csil, you can set up an SSH key and config so connecting is easy in the future:
ssh-keygen -t ed25519
echo -e "Host cluster.cs.sfu.ca\n Port 24\n ServerAliveInterval 120\n LocalForward 8088 controller.local:8088\n LocalForward 9870 controller.local:9870" >> ~/.ssh/config
ssh-copy-id cluster.cs.sfu.ca
ssh cluster.cs.sfu.ca
Then to connect:
ssh cluster.cs.sfu.ca
Copying Files
Copying files will also be a two-step process: to gateway.csil, and then from there:
scp mycode.py cluster.cs.sfu.ca:
Spark Applications
Once you have the code there, you can start jobs as usual with spark-submit
, and they will be sent to the cluster:
spark-submit code.py ...
Cleaning Up
If you have unnecessary files sitting around (especially large files created as part of an assignment), please clean them up with a command like this:
hdfs dfs -rm -r output*
It is possible that you have jobs running and consuming resources without knowing: maybe you created an infinite loop or otherwise have a job burning memory or CPU. You can list jobs running on the cluster like this:
yarn application -list
And kill a specific job:
yarn application -kill <APPLICATION_ID>