We have a small Hadoop cluster for this course, based on Cloudera express.
The goal here is to connect to
gateway.sfucloud.ca by SSH. Since you can't connect directly from the outside world, it's not completely straightforward.
Option 1: the right way
If you don't already have one, create an SSH key so you can log in without a password. The command will be like:
ssh-keygen -t rsa -b 4096 -N ""
Then copy your public key to the server:
Create or add to the
~/.ssh/config (on your local computer, not the cluster gateway) this configuration that will let you connect to the cluster by SSH. Then you can simply
ssh gateway.sfucloud.ca to connect.
Host gateway.sfucloud.ca User <USERID> LocalForward 8088 master.sfucloud.ca:8088 LocalForward 19888 master.sfucloud.ca:19888 LocalForward 50070 master.sfucloud.ca:50070
With this configuration, port forwards will let you connect (in a limited unauthenticated way) to the web interfaces:
- HDFS namenode: http://localhost:50070/
- YARN application master: http://localhost:8088/
- MapReduce job history server: http://localhost:19888/
Once it's set up, you should be able to copy files and connect remotely quickly:
scp wordcount.jar gateway.sfucloud.ca: ssh gateway.sfucloud.ca
Option 2: just get it working
You will be connecting to the cluster a lot: you will want to get things set up more nicely to make your life easier later. But, this should at least work.
You generally just need to SSH to
gateway.sfucloud.ca (substituting whatever SSH method you use on your computer):
[yourcomputer]$ ssh <USERID>@gateway.sfucloud.ca [gateway] $
Once you're connected to the Hadoop gateway, you can start running
You will also frequently need to copy files to the cluster:
[yourcomputer]$ scp assignment.jar <USERID>@gateway.sfucloud.ca:
If you need access to the web frontends in the cluster, you can do the initial SSH with a much longer command including a bunch of port forwards:
ssh -L 50070:master.sfucloud.ca:50070 -L 8088:master.sfucloud.ca:8088 <USERID>@gateway.sfucloud.ca
Then at the command line, use the application ID from that list to get the logs like this:
yarn logs -applicationId application_1234567890123_0001 | less
If you have unnecessary files sitting around (especially large files created as part of an assignment), please clean them up with a command like this:
hdfs dfs -rm -r /user/<USERID>/output*
It is possible that you have jobs running and consuming resources without knowing: maybe you created an infinite loop or otherwise have a job burning memory or CPU. You can list jobs running on the cluster like this:
yarn application -list
And kill a specific job:
yarn application -kill <APPLICATION_ID>