The last few weeks of the course will be devoted to a final project. You can work on the project in small groups (2–4 members). See below for due dates (or the main CourSys page for the course).
The project topic is your choice: we will be looking for you to demonstrate a good understanding of the big data tools and how they are used to analyze data. Some possible places where difficulty may occur, which will take into account when marking:
- The creation of a data set (by downloading/aggregating/cleaning info, ETL work, etc), if relevant.
- Data analysis and difficulty of computation.
- Decent presentation, summarization, visualization of your results.
- Technical difficulty: new technologies/techniques that you had to deal with to complete the project.
The “data analysis” is definitely required. The others might be part of your project or not (but at least some should be there). Having some decent presentation of your results is recommended.
Before Monday October 26 2020, submit a brief proposal (≤1 page). Please indicate the proposed topic/analysis/product, as well as the technologies you intend to use. We will provide some feedback on your proposed project.
I will point out two things on the subject of how big the project should be: (1) it is worth 25% of the final grade and each assignment is worth 7.5%, so the project should be around three assignments worth of work; (2) it is the final project for a 6-credit graduate-level course.
We will provide some feedback on scale along with your proposal.
Implementation should use big data tools. This may include MapReduce or Spark, but not limited to those. You are welcome to explore new technologies and this will be considered to add to the “difficulty” of your project. See the The Big-Data Ecosystem Table for many amusing big-data technologies (and some not so “big” and others not even so much about “data”).
There is no requirement that your project process “big” data: many interesting data sets aren't that big. But your implementation should be scalable: if the data set grew to be “big”, it should be able to process it. Your project will be more interesting (and a better portfolio piece) if you have a good-sized data set to work on.
At the same time, please don't plan to do computations that will monopolize our (shared) cluster's resources unreasonably.
You must use a Git repository for your project. The department's GitLab server is a good way to get one (instructions at that link). Group members must commit their own contributions to the repo. Please give the instructor and TAs (ggbaker, sbergner, nangrish) developer access to your repository.
You are encouraged to publicize and open-source your work on GitHub or similar.
We will have a time to demonstrate your project on [No activity "Demo"] from 10:00 to 1:00 in the Big Data atrium (10000-level of the ASB). During this time, the instructors and TAs will get a first-look at your project, so we have some idea what we'll be looking at when marking the final implementation.
We will be asking you to set up computers around the space to show your project to whoever stops by (us, other students, other faculty members who wander by). There is no formal presentation: just be ready to talk about and show off your work.
If your group doesn't have a laptop to bring, ask and we'll arrange something.
The final implementation is due Friday December 18 2020. You will submit a tag from your repository (
git tag final; git push --tags) to the CourSys activity Project.
In your repository, please include a file
RUNNING.md if you prefer) indicating how we can actually test your project: commands on the cluster, input files, etc.
In your repository's
README.md, you may include other notes about things we should look for, or be aware of when marking. If you created some kind of web frontend, please include a URL in the
README.md as well.
You will submit a report of at most 5 pages giving an overview of your project. (A little more is okay for bigger groups.)
- Problem definition: What is the problem that you are trying to solve? What are the challenges of this problem?
- Methodology: What is the problem that you are trying to solve? Briefly explain which tool(s)/technique(s) were used for which task and why you chose to implement that way.
- Problems: What problems did you encounter while attacking the problem? How did you solve them?
- Results: What are the outcomes of the project? What did you learn from the data analysis? What did you learn from the implementation?
- ProjectSummary: A summary of what you did to guide our marking.
This is also due Friday December 18 2020 and submitted to Project as a PDF or a URL to HTML (only one of those is necessary).
[Thanks to Jean-Pierre Lozi and Arash Vahdat who taught this course before me, and whose work I have based this and many assignment questions on.]