Project Ideas
Some Ideas
None of these ideas are carefully thought out. That's the point: they are intended as a starting point that a group can work from, not a problem specification.
- An exploration of data compression. We have been using data that's either uncompressed or gzip compressed. Other options include snappy and lz4. How do these compare for read+uncompress and write+compress speed? Does storage type (spinning, SSD, NVMe) affect this significantly? Which should we choose, and in what circumstances?
- There's a lot more to be discovered in the GHCN data that we used for a few assignments. The full data set is available on the cluster for further examination: Weather data processing details
- Obviously people travel for more than the weather, but what can you discover about the weather in popular tourist destinations?
- The full Reddit Submissions and Comments corpus is available on the cluster. There are countless questions that could be asked about it. Reddit data processing details
- What makes a submission good/bad? Perhaps time of day or sentiment or readability score affect how a post is perceived?
- Do regional subreddits (e.g. for countries/provinces/cities) have activity that is proportional to their population? If not, why not? What does affect the size of their community?
- Your phone is an powerful collection of sensors. Apps like Physics Toolbox Sensor Suite let your get at them (on Android, and no doubt similar apps exist on iOS).
- A GoPro has an even more impressive collection of sensors. Any of the above, but with a GoPro.
- The benchmarking of sorting algorithms that we did in an exercise was fairly minimal. Maybe there's a larger benchmarking opportunity you'd be interested in.
- Can you find independent implementations of common algorithms and compare them? Possibly "identical" functions in glibc and musl are more different than they seem if examined closely.
- How does the best implementation you can create compare to standard library code in C or Rust or Python?
- Or you could compare Pandas with various other DataFrame libraries like Polars or Spark.
- WikiData has an amazing variety of machine-readable data you can download or query and work with. This sublist could be effectively endless, but…
- Any Wikipedia "List of X" article probably has corresponding WikiData data you could download.
- There are a lot of lakes in BC (not all of which have a Wikipedia page: there is likely a more complete database of geographic features, or maybe you want the "notable" ones that do have a Wikipedia page). Can you combine a list of them with some kind of biodiversity data and find trends?
- OpenStreetMap data can be exported, and the full
planet.osm
is available on the compute cluster. Questions about geographic features are endless.- Greg has some code to extract from the (painful)
planet.osm
file: if somebody needs it an bugs him, he'd probably get around to documenting it. - Overpass Turbo may help you extract relevant data as well.
- Greg has some code to extract from the (painful)
- Cities something something
- James Hoffmann's Great American Coffee Taste Test results [full data link in video description]
- /r/financialindependence survey results may reveal interesting results if looked at the right way.
- Something about chess moves/games, like the analysis from The rarest move in chess.
- Exploration of Canadian grocery price data.
Updated Mon Sept. 09 2024, 10:13 by ggbaker.