The last few weeks of the course will be devoted to a final project. You can work on the project in small groups (2–4 members). See below for due dates (or the main CourSys page for the course).
The project topic is your choice: we will be looking for you to demonstrate a good understanding of the big data tools and how they are used to analyze data. Some possible places where difficulty may occur, which will take into account when marking:
- The creation of a data set (by downloading/aggregating/cleaning info, ETL work, etc), if relevant.
- Data analysis and difficulty of computation.
- Decent presentation, summarization, visualization of your results.
- Technical difficulty: new technologies/techniques that you had to deal with to complete the project.
The “data analysis” is definitely required. The others might be part of your project or not (but at least some should be there). Having some decent presentation of your results is recommended.
Before Monday October 26 2020, submit a brief proposal (≤1 page). Please indicate the proposed topic/analysis/product, as well as the technologies you intend to use. We will provide some feedback on your proposed project.
I will point out two things on the subject of how big the project should be: (1) it is worth 25% of the final grade and each assignment is worth 7%, so the project should be around three assignments worth of work per person; (2) it is the final project for a 6-credit graduate-level course.
We will provide some feedback on scale along with your proposal.
Implementation should use big data tools. This may include MapReduce or Spark, but not limited to those. You are welcome to explore new technologies and this will be considered to add to the “difficulty” of your project. See the The Big-Data Ecosystem Table for many amusing big-data technologies (and some not so “big” and others not even so much about “data”).
There is no requirement that your project process “big” data: many interesting data sets aren't that big. But your implementation should be scalable: if the data set grew to be “big”, it should be able to process it. Your project will be more interesting (and a better portfolio piece) if you have a good-sized data set to work on.
At the same time, please don't plan to do computations that will monopolize our (shared) cluster's resources unreasonably.
You must use a Git repository for your project. The department's GitLab server is a good way to get one (instructions at that link). Group members must commit their own contributions to the repo. Please give the instructor and TAs (ggbaker, sbergner, nangrish) developer access to your repository.
You are encouraged to publicize and open-source your work on GitHub or similar.
You will be demonstrating your project in a short video, due with the rest of the project.
- A 5–10 minute video presentation.
- Submit a URL to the video: on your YouTube channel if you like, or a file in Google Drive or similar and we will post it on the PMP channel.
- No student numbers or other personal info in the video. Include your names if you want, but it's not required.
- It is not required for you to have a webcam or show your face in the video.
- It is not required to have all group members appear in the video: if your group is happy with the distribution of work, then it's fine to have one or two group members take care of the presentation.
- Make sure your audio is understandable and your on-screen content is visible. In particular, make sure your font size (in your slides, web browser, editor, terminal, etc) is large enough to be readable on screen: typically about twice the size of usual editor text. When you are recording, make sure you speak clearly and generally a little louder than you usually would sitting in front of your computer.
The final implementation is due Friday December 11 2020. You will submit a tag from your repository (
git tag final; git push --tags) to the CourSys activity Project.
In your repository, please include a file
RUNNING.md if you prefer) indicating how we can actually test your project: commands on the cluster, input files, etc.
In your repository's
README.md, you may include other notes about things we should look for, or be aware of when marking. If you created some kind of web frontend, please include a URL in the
README.md as well.
You will submit a report of at most 5 pages giving an overview of your project. (A little more is okay for bigger groups.)
- Problem definition: What is the problem that you are trying to solve? What are the challenges of this problem?
- Methodology: What is the problem that you are trying to solve? Briefly explain which tool(s)/technique(s) were used for which task and why you chose to implement that way.
- Problems: What problems did you encounter while attacking the problem? How did you solve them?
- Results: What are the outcomes of the project? What did you learn from the data analysis? What did you learn from the implementation?
- ProjectSummary: A summary of what you did to guide our marking.
This is also due Friday December 11 2020 and submitted to Project as a PDF or a URL to HTML (only one of those is necessary).
[Thanks to Jean-Pierre Lozi and Arash Vahdat who taught this course before me, and whose work I have based this and many assignment questions on.]