Project Cluster Notes
Our cluster is the obvious place to run code for the project, but since it's a shared resource we should all be a little careful. Please don't run jobs that monopolize the cluster's resources for long periods of time. If you have big jobs, please run them at night (or maybe early morning).
If you would like to explore Amazon Web Services, then they can certainly provide with (and charge you for) as much computing power as you need. The GitHub Student Developer Pack contains many useful things, including $50 of AWS credit.
See also the project technologies page.
If you are looking for more computation power than the cluster provides, ask Greg.
Storage & HDFS
Our cluster has around 12TB of storage, which should be enough to handle most of what is happening in the projects.
If you have large intermediate data (e.g. pre-ETL files, or partial calculation output), please keep them cleaned up. You could also keep some files around, but decrease the replication factor (to 1 in the example) to lessen the actual disk usage:
hdfs dfs -setrep -R 1 raw_downloads
If you need to transfer significant amounts of data to the cluster, you can upload to your home directory on the cluster. You can do some processing on the data there, or
hdfs dfs -copyFromLocal.
If you're working in a group, you will probably want to use the same collection of data in the HDFS. One of you can upload the data and make it world-readable like this:
hdfs dfs -chmod 0771 /user/<USERID> hdfs dfs -chmod 0775 -R /user/<USERID>/dataset