Project

The last few weeks of the course will be devoted to a final project. You can work on the project in small groups (34 members). See below for due dates (or the main CourSys page for the course).

The project topic is your choice: we will be looking for you to demonstrate a good understanding of the big data tools and how they are used to analyze data. Some possible places where difficulty may occur, which will take into account when marking:

• Data analysis and difficulty of computation.
• Decent presentation, summarization, visualization of your results.
• Technical difficulty: new technologies/techniques that you had to deal with to complete the project.

The data analysis is definitely required. The others might be part of your project or not (but at least some should be there). Having some decent presentation of your results is recommended.

See the collected DataSets links for some possible inspiration (and if you have any data sets to add, post in the discussion forum). See also ProjectIdeas for some potential ideas.

Proposal

Before Monday October 25 2021, submit a brief proposal (1 page). Please indicate the proposed topic/analysis/product, as well as the technologies you intend to use. We will provide some feedback on your proposed project.

Create your group in CourSys (same group for all project-related stuff obviously) and submit to the CourSys activity Project Proposal

On Scale

I will point out two things on the subject of how big the project should be: (1) it is worth 25% of the final grade and each assignment is worth 7%, so the project should be around three assignments worth of work per person; (2) it is the final project for a 6-credit graduate-level course.

We will provide some feedback on scale along with your proposal.

Implementation

Implementation should use big data tools. This may include MapReduce or Spark, but not limited to those. You are welcome to explore new technologies and this will be considered to add to the difficulty of your project. See the The Big-Data Ecosystem Table for many amusing big-data technologies (and some not so big and others not even so much about data).

There is no requirement that your project process big data: many interesting data sets aren't that big. But your implementation should be scalable: if the data set grew to be big, it should be able to process it. Your project will be more interesting (and a better portfolio piece) if you have a good-sized data set to work on.

At the same time, please don't plan to do computations that will monopolize our (shared) cluster's resources unreasonably.

Source Control

You must use a Git repository for your project. The department's GitLab server is a good way to get one (instructions at that link). Group members must commit their own contributions to the repo. Please give the instructor and TAs (ggbaker, kaiyeec, bha59, vpa12, hjs10) developer access to your repository.

You are encouraged to publicize and open-source your work on GitHub or similar.

Computing Power

See ProjectTech and ProjectCluster.

Demos

You will be demonstrating your project in a short video, due with the rest of the project.

• A 510 minute video presentation.
• Submit a URL to the video: on your YouTube channel if you like, or a file in Google Drive or similar and we will post it on the PMP channel.
• No student numbers or other personal info in the video. Include your names if you want, but it's not required.
• It is not required for you to have a webcam or show your face in the video.
• It is not required to have all group members appear in the video: if your group is happy with the distribution of work, then it's fine to have one or two group members take care of the presentation.
• Make sure your audio is understandable and your on-screen content is visible. In particular, make sure your font size (in your slides, web browser, editor, terminal, etc) is large enough to be readable on screen: typically about twice the size of usual editor text. When you are recording, make sure you speak clearly and generally a little louder than you usually would sitting in front of your computer.

Final Implementation

The final implementation is due Friday December 10 2021. You will submit a tag from your repository (git tag final; git push --tags) to the CourSys activity Project.

In your repository, please include a file RUNNING.txt (or RUNNING.md if you prefer) indicating how we can actually test your project: commands on the cluster, input files, etc.

In your repository's README.md, you may include other notes about things we should look for, or be aware of when marking. If you created some kind of web frontend, please include a URL in the README.md as well.

Report

You will submit a report of at most 5 pages giving an overview of your project. (A little more is okay for bigger groups.)

• Problem definition: What is the problem that you are trying to solve? What are the challenges of this problem?
• Methodology: What is the problem that you are trying to solve? Briefly explain which tool(s)/technique(s) were used for which task and why you chose to implement that way.
• Problems: What problems did you encounter while attacking the problem? How did you solve them?
• Results: What are the outcomes of the project? What did you learn from the data analysis? What did you learn from the implementation?
• ProjectSummary: A summary of what you did to guide our marking.

This is also due Friday December 10 2021 and submitted to Project as a PDF or a URL to HTML (only one of those is necessary).

[Thanks to Jean-Pierre Lozi and Arash Vahdat who taught this course before me, and whose work I have based this and many assignment questions on.]

Updated Wed Sept. 08 2021, 14:52 by ggbaker.