Project Ideas

Some Ideas

None of these ideas are carefully thought out. That's the point: they are intended as a starting point that a group can work from, not a problem specification.

An exploration of data compression. We have been using data that's either uncompressed or gzip compressed. Other options include snappy and lz4. How do these compare for read+uncompress and write+compress speed? Does storage type (spinning, SSD, NVMe) affect this significantly? Which should we choose, and in what circumstances?
There's a lot more to be discovered in the GHCN data that we used for a few assignments. The full data set is available on the cluster for further examination: Weather data processing details
- Obviously people travel for more than the weather, but what can you discover about the weather in popular tourist destinations?
The full Reddit Submissions and Comments corpus is available on the cluster. There are countless questions that could be asked about it. Reddit data processing details
- What makes a submission good/bad? Perhaps time of day or sentiment or readability score affect how a post is perceived?
- Do regional subreddits (e.g. for countries/provinces/cities) have activity that is proportional to their population? If not, why not? What does affect the size of their community?
Your phone is an powerful collection of sensors. Apps like Physics Toolbox Sensor Suite let your get at them (on Android, and no doubt similar apps exist on iOS).
- A GoPro has an even more impressive collection of sensors. Any of the above, but with a GoPro.
The benchmarking of sorting algorithms that we did in an exercise was fairly minimal. Maybe there's a larger benchmarking opportunity you'd be interested in.
- Can you find independent implementations of common algorithms and compare them? Possibly "identical" functions in glibc and musl are more different than they seem if examined closely.
- How does the best implementation you can create compare to standard library code in C or Rust or Python?
- Or you could compare Pandas with various other DataFrame libraries like Polars or Spark.
WikiData has an amazing variety of machine-readable data you can download or query and work with. This sublist could be effectively endless, but…
- Any Wikipedia "List of X" article probably has corresponding WikiData data you could download.
- There are a lot of lakes in BC (not all of which have a Wikipedia page: there is likely a more complete database of geographic features, or maybe you want the "notable" ones that do have a Wikipedia page). Can you combine a list of them with some kind of biodiversity data and find trends?
OpenStreetMap data can be exported, and the full planet.osm is available on the compute cluster. Questions about geographic features are endless.
- Greg has some code to extract from the (painful) planet.osm file: if somebody needs it an bugs him, he'd probably get around to documenting it.
- Overpass Turbo may help you extract relevant data as well.
Cities something something
James Hoffmann's Great American Coffee Taste Test results [full data link in video description]
/r/financialindependence survey results may reveal interesting results if looked at the right way.
Something about chess moves/games, like the analysis from The rarest move in chess.

Updated Thu June 27 2024, 22:28 by ggbaker.

Simon Fraser University
Engaging the World

CourSys

Project Ideas

Some Ideas