Project Ideas
Some Ideas
None of these ideas are carefully thought out. That's the point: they are intended as a starting point that a group can work from, not a problem specification.
- An exploration of data compression. We have been using data that's either uncompressed or gzip compressed. Other options include snappy and lz4. How do these compare for read+uncompress and write+compress speed? Does storage type (spinning, SSD, NVMe) affect this significantly? Which should we choose, and in what circumstances?
- There's a lot more to be discovered in the GHCN data that we used for a few assignments. The full data set is available on the cluster for further examination.
- Obviously people travel for more than the weather, but what can you discover about the weather in popular tourist destinations?
- The full Reddit Submissions and Comments corpus is available on the cluster. There are countless questions that could be asked about it.
- What makes a submission good/bad? Perhaps time of day or sentiment or readability score affect how a post is perceived?
- Do regional subreddits (e.g. for countries/provinces/cities) have activity that is proportional to their population? If not, why not? What does affect the size of their community?
- Your phone is an powerful collection of sensors. Apps like Physics Toolbox Sensor Suite let your get at them (on Android, and no doubt similar apps exist on iOS).
- A GoPro has an even more impressive collection of sensors. Any of the above, but with a GoPro.
- The benchmarking of sorting algorithms that we did in an exercise was fairly minimal. Maybe there's a larger benchmarking opportunity you'd be interested in.
- Can you find independent implementations of common algorithms and compare them? Possibly "identical" functions in glibc and musl are more different than they seem if examined closely.
- How does the best implementation you can create compare to standard library code in C or Rust or Python?
- Or you could compare Pandas with various other DataFrame libraries like Polars or Spark.
- WikiData has an amazing variety of machine-readable data you can download or query and work with. This sublist could be effectively endless, but…
- Any Wikipedia "List of X" article probably has corresponding WikiData data you could download.
- There are a lot of lakes in BC (not all of which have a Wikipedia page: there is likely a more complete database of geographic features, or maybe you want the "notable" ones that do have a Wikipedia page). Can you combine a list of them with some kind of biodiversity data and find trends?
- OpenStreetMap data can be exported, and the full
planet.osm
is available on the compute cluster. Questions about geographic features are endless.- Overpass Turbo may help you extract relevant data as well.
- cities something something
- James Hoffmann's Great American Coffee Taste Test results [full data link in video description]
Big Data Sets
You are welcome to use Spark and the cluster for your project, but there is no requirement to do big data. These data sets are available on the compute cluster (reformatted so they can be processed in Spark fairly easily) and may inspire interesting project ideas:
- The reddit data: all Reddit submissions and comments up to March 2023, when the data collection stopped. The code at that link can be used to extract reasonably-sized subsets that you can
- The Global Historical Climatology Network data set: historical daily weather around the world.
- An OpenStreetMap data dump.
- A WikiData data dump.
- An English Wikipedia data dump.
If you're going to work with larger data sets, you probably want to extract smaller subsets of it and then move off the cluster and work with Pandas or similar. See the Reddit code linked above, or this code sketch from the lecture notes.
It's likely that Greg has code that does something with these datasets in Spark: ask on the discussion forum if you'd like guidance.