Not logged in. Login

Project

The last few weeks of the course will be devoted to a final project. You can work on the project in small groups (24 members). See below for due dates (or the main CourSys page for the course).

The project topic is your choice: we will be looking for you to demonstrate a good understanding of the big data tools and how they are used to analyze data. Some possible places where difficulty may occur, which will take into account when marking:

  • The creation of a data set (by downloading/aggregating/cleaning info, ETL work, etc), if relevant.
  • Data analysis and difficulty of computation.
  • Decent presentation, summarization, visualization of your results.
  • Technical difficulty: new technologies/techniques that you had to deal with to complete the project.

The data analysis is definitely required. The others might be part of your project or not (but at least some should be there). Having some decent presentation of your results is recommended.

See the collected DataSets links for some possible inspiration (and if you have any data sets to add, email Greg). See also ProjectIdeas for some potential ideas.

Proposal

Before Monday October 15 2018, submit a brief proposal (1 page). Please indicate the proposed topic/analysis/product, as well as the technologies you intend to use. Greg will provide some feedback on your proposed project.

Create your group in CourSys (same group for all project-related stuff obviously) and submit to the CourSys activity Project Proposal

On Scale

I will point out two things on the subject of how big the project should be: (1) it is worth 25% of the final grade and each assignment is worth 7.5%, so the project should be around three assignments worth of work; (2) it is the final project for a 6-credit graduate-level course.

I will provide some feedback on scale along with your proposal.

Implementation

Implementation should use big data tools. This may include MapReduce or Spark, but not limited to those. You are welcome to explore new technologies and this will be considered to add to the difficulty of your project. See the The Big-Data Ecosystem Table for many amusing big-data technologies (and some not so big and others not even so much about data).

There is no requirement that your project process big data: many interesting data sets aren't that big. But your implementation should be scalable: if the data set grew to be big, it should be able to process it. Your project will be more interesting (and a better portfolio piece) if you have a good-sized data set to work on.

At the same time, please don't plan to do computations that will monopolize our (shared) cluster's resources unreasonably.

Source Control

You must use a Git repository for your project. The department's GitLab server is a good way to get one (instructions at that link). Group members must commit their own contributions to the repo. Please give the instructor and TAs (ggbaker, sannamal, vshukla, rpsingh) developer access to your repository.

You are encouraged to publicize and open-source your work on GitHub or similar.

Computing Power

See ProjectTech and ProjectCluster.

Demo Day

We will have a time to demonstrate your project on Thursday December 06 2018 from 1:00 to 4:00 in the Big Data atrium (10000-level of the ASB). During this time, the instructors and TAs will get a first-look at your project, so we have some idea what we'll be looking at when marking the final implementation.

We will be asking you to set up computers around the space to show your project to whoever stops by (us, other students, other faculty members who wander by). There is no formal presentation: just be ready to talk about and show off your work.

If your group doesn't have a laptop to bring, ask and we'll arrange something.

Final Implementation

The final implementation is due Friday December 07 2018. You will submit a tag from your repository (git tag final; git push --tags) to the CourSys activity Project.

In your repository, please include a file RUNNING.txt (or RUNNING.md if you prefer) indicating how we can actually test your project: commands on the cluster, input files, etc.

In your repository's README.md, you may include other notes about things we should look for, or be aware of when marking. If you created some kind of web frontend, please include a URL in the README.md as well.

Report

You will submit a report of at most 5 pages giving an overview of your project. (A little more is okay for groups of 3; a little less is probably appropriate for groups of 1 or 2.)

  • Problem definition: What is the problem that you are trying to solve? What are the challenges of this problem?
  • Methodology: What is the problem that you are trying to solve? Briefly explain which tool(s)/technique(s) were used for which task and why you chose to implement that way.
  • Problems: What problems did you encounter while attacking the problem? How did you solve them?
  • Results: What are the outcomes of the project? What did you learn from the data analysis? What did you learn from the implementation?
  • ProjectSummary: A summary of what you did to guide our marking.

This is also due Friday December 07 2018 and submitted to Project as a PDF or a URL to HTML (only one of those is necessary).

[Thanks to Jean-Pierre Lozi and Arash Vahdat who taught this course before me, and whose work I have based this and many assignment questions on.]

Updated Fri Nov. 30 2018, 13:40 by ggbaker.