Not logged in. Login

Cluster

We have a small Hadoop cluster for this course, based on Cloudera express.

Connecting Remotely

The goal here is to connect to gateway.sfucloud.ca by SSH. Since you can't connect directly from the outside world, it's not completely straightforward.

Option 1: the right way

If you don't already have one, create an SSH key so you can log in without a password. The command will be like:

ssh-keygen -t rsa -b 4096 -N ""

Then copy your public key to the server:

ssh-copy-id <USERID>@gateway.sfucloud.ca

Create or add to the ~/.ssh/config (on your local computer, not the cluster gateway) this configuration that will let you connect to the cluster by SSH. Then you can simply ssh gateway.sfucloud.ca to connect.

Host gateway.sfucloud.ca
  User <USERID>
  LocalForward 8088 master.sfucloud.ca:8088
  LocalForward 19888 master.sfucloud.ca:19888
  LocalForward 50070 master.sfucloud.ca:50070

With this configuration, port forwards will let you connect (in a limited unauthenticated way) to the web interfaces:

Once it's set up, you should be able to copy files and connect remotely quickly:

scp wordcount.jar gateway.sfucloud.ca:
ssh gateway.sfucloud.ca

Option 2: just get it working

You will be connecting to the cluster a lot: you will want to get things set up more nicely to make your life easier later. But, this should at least work.

You generally just need to SSH to gateway.sfucloud.ca (substituting whatever SSH method you use on your computer):

[yourcomputer]$ ssh <USERID>@gateway.sfucloud.ca 
[gateway] $

Once you're connected to the Hadoop gateway, you can start running hdfs and yarn commands.

You will also frequently need to copy files to the cluster:

[yourcomputer]$ scp assignment.jar <USERID>@gateway.sfucloud.ca:

If you need access to the web frontends in the cluster, you can do the initial SSH with a much longer command including a bunch of port forwards:

ssh -L 50070:master.sfucloud.ca:50070 -L 8088:master.sfucloud.ca:8088 <USERID>@gateway.sfucloud.ca

Job Logs

If you have set up your SSH config file as in the Cluster instructions, you can see the list of jobs that have run on the cluster at http://localhost:8088/.

Then at the command line, use the application ID from that list to get the logs like this:

yarn logs -applicationId application_1234567890123_0001 | less

Cleaning Up

If you have unnecessary files sitting around (especially large files created as part of an assignment), please clean them up with a command like this:

hdfs dfs -rm -r /user/<USERID>/output*

It is possible that you have jobs running and consuming resources without knowing: maybe you created an infinite loop or otherwise have a job burning memory or CPU. You can list jobs running on the cluster like this:

yarn application -list

And kill a specific job:

yarn application -kill <APPLICATION_ID>
Updated Tue Aug. 27 2019, 21:34 by ggbaker.