We have a relatively modest Hadoop cluster for this course: 4 nodes, 60 cores, 128GB memory, 16TB storage.
The goal here is to connect to
cluster.cs.sfu.ca by SSH.
[This will work if you are within the campus network (or have the SFU VPN).]
You will be connecting to the cluster a lot: you may want to get things set up more nicely to make your life easier later. But, this should at least work.
You generally just need to SSH to cluster.cs.sfu.ca (substituting whatever SSH method you use on your computer):
[yourcomputer]$ ssh -p24 <USERID>@cluster.cs.sfu.ca [gateway]$
Once you're connected to the cluster gateway, you can start running
If you need access to the web frontends in the cluster, you can do the initial SSH with a longer command including some port forwards:
ssh -p24 -L 9870:controller.local:9870 -L 8088:controller.local:8088 <USERID>@cluster.cs.sfu.ca
From On-Campus With SSH Keys
Once you have confirmed that you can connect, get things set up properly…
Create an SSH key (if you don't have one already) so you can log in without a password. Then copy your public key into
.ssh/authorized_keys on the server (with
ssh-copy-id or by appending to
Create (or add to) the
~/.ssh/config file on your computer. With this config, you can simply
ssh cluster.cs.sfu.ca to connect. (bonus: tab-completion)
Host cluster.cs.sfu.ca User <USERID> Port 24 LocalForward 8088 controller.local:8088 LocalForward 9870 controller.local:9870
With this configuration, port forwards will let you connect (in a limited unauthenticated way) to the web interfaces:
You will also frequently need to copy files to the cluster:
[yourcomputer]$ scp code.py <USERID>@cluster.cs.sfu.ca:
Or whatever your preferred SCP/SFTP method is.
From off-campus networks, you need an extra hop to get to the cluster. The most reliable is probably to ssh to gateway.csil.sfu.ca (port 24). We need to do a two-step port forward. On a Linux-like system, you can add this to your
~/.ssh/config to forward from your computer to the gateway:
Host gateway.csil.sfu.ca User <USERID> Port 24 LocalForward 8088 localhost:8088 LocalForward 9870 localhost:9870 ServerAliveInterval 120
Then on gateway.csil, you can set up an SSH key and config so connecting is easy in the future:
ssh-keygen -t ed25519 echo -e "Host cluster.cs.sfu.ca\n Port 24\n ServerAliveInterval 120\n LocalForward 8088 controller.local:8088\n LocalForward 9870 controller.local:9870" >> ~/.ssh/config ssh-copy-id cluster.cs.sfu.ca ssh cluster.cs.sfu.ca
Then to connect:
Copying files will also be a two-step process: to gateway.csil, and then from there:
scp mycode.py cluster.cs.sfu.ca:
Once you have the code there, you can start jobs as usual with
spark-submit, and they will be sent to the cluster:
spark-submit code.py ...
If you have unnecessary files sitting around (especially large files created as part of an assignment), please clean them up with a command like this:
hdfs dfs -rm -r output*
It is possible that you have jobs running and consuming resources without knowing: maybe you created an infinite loop or otherwise have a job burning memory or CPU. You can list jobs running on the cluster like this:
yarn application -list
And kill a specific job:
yarn application -kill <APPLICATION_ID>