Not logged in. Login

Project Ideas

Some Ideas

None of these ideas are carefully thought out. That's the point: they are intended as a starting point that a group can work from, not a problem specification.

  • An exploration of data compression. We have been using data that's either uncompressed or gzip compressed. Other options include snappy and lz4. How do these compare for read+uncompress and write+compress speed? Does storage type (spinning, SSD, NVMe) affect this significantly? Which should we choose, and in what circumstances?
  • There's a lot more to be discovered in the GHCN data that we used for a few assignments. The full data set is available on the cluster for further examination: Weather data processing details
  • The full Reddit Submissions and Comments corpus is available on the cluster. There are countless questions that could be asked about it. Reddit data processing details
    • What makes a submission good/bad? Perhaps time of day or sentiment or readability score affect how a post is perceived?
    • Do regional subreddits (e.g. for countries/provinces/cities) have activity that is proportional to their population? If not, why not? What does affect the size of their community?
  • Your phone is an powerful collection of sensors. Apps like Physics Toolbox Sensor Suite let your get at them (on Android, and no doubt similar apps exist on iOS).
  • The benchmarking of sorting algorithms that we did in an exercise was fairly minimal. Maybe there's a larger benchmarking opportunity you'd be interested in.
    • Can you find independent implementations of common algorithms and compare them? Possibly "identical" functions in glibc and musl are more different than they seem if examined closely.
    • How does the best implementation you can create compare to standard library code in C or Rust or Python?
    • Or you could compare Pandas with various other DataFrame libraries like Polars or Spark.
  • WikiData has an amazing variety of machine-readable data you can download or query and work with. This sublist could be effectively endless, but…
    • Any Wikipedia "List of X" article probably has corresponding WikiData data you could download.
    • There are a lot of lakes in BC (not all of which have a Wikipedia page: there is likely a more complete database of geographic features, or maybe you want the "notable" ones that do have a Wikipedia page). Can you combine a list of them with some kind of biodiversity data and find trends?
  • OpenStreetMap data can be exported, and the full planet.osm is available on the compute cluster. Questions about geographic features are endless.
    • Greg has some code to extract from the (painful) planet.osm file: if somebody needs it an bugs him, he'd probably get around to documenting it.
    • Overpass Turbo may help you extract relevant data as well.
  • Cities something something
  • James Hoffmann's Great American Coffee Taste Test results [full data link in video description]
  • /r/financialindependence survey results may reveal interesting results if looked at the right way.
  • Something about chess moves/games, like the analysis from The rarest move in chess.
Updated Thu June 27 2024, 22:28 by ggbaker.