The technologies used for the project are up to you.
Computing resources available:
- The cluster we have been using all semester. Please be reasonable when running large jobs on the shared cluster.
- The workstations in the lab: each has a 7th gen i7 processor, 32GB memory, and a GTX 1050 GPU. Disk quota may be an issue, but your
~/VirtualBox VMsdirectory does not have quota restrictions. The workstations can be accessed remotely and should have CUDA/Tensorflow installed: documentation.
Technology that is available:
- MapReduce, Spark (as we have been using them).
- HDFS (as we have been using, and see notes on the ProjectCluster page).
- Cassandra database cluster (used in the assignments). See Cassandra instructions.
- Kafka is installed on the cluster (and used in the assignments). See Kafka instructions.
- HBase and Phoenix could be installed on the cluster.
- RabbitMQ can be installed on the gateway (and is thus only really ready for small-scale data: if you need high volume RabbitMQ, ask).
The cluster is based on the Cloudera distribution, so packaged tools can be installed easily on the main cluster. That includes: Flume, Hive, Impala. If you'd like to use any of these, ask.
Additional virtual machines can be provisioned for a group if you want to experiment with other technologies, or have a web server for a frontend, or have other needs that we don't support on the main cluster.
If you would like to have a web frontend that displays results from Spark jobs, the slowness of creating a Spark Session will be an issue. The solution is to have a long-running spark instance and send in queries as needed.
That may be combining Spark + the Celery task queue. You should be able to set up Celery tasks that do Spark work, and then call them from your web frontend. If you want to use this solution, you need a RabbitMQ account set up: ask Greg.
Web Scraping and API Calls
If you need to scrape web pages or make API calls as part of your project, be very cautious about how often you make requests: it is very easy to get your IP address banned from these services.
Your first goal should be to make as few requests as possible. You should cache responses in some persistent way, and only fetch/request if you have never done so before. The logic will be something like:
def get_remote_info(url): if url in cache: return cache[url] else: data = fetch(url) cache[url] = data return data
Depending on your situation, the cache could be a file or a Cassandra table or something else appropriate.
If you need to have a scraper doing this for a long time, you can run it on
ts.sfucloud.ca in a tmux or GNU Screen session: that way, you can log off and reconnect to check your progress.