Simon Fraser University
Engaging the World

CourSys

Not logged in. Login

Data Sets

This is a collection of publicly-available data sets that that we have used/will use (whole or in part) for assignments in this course:

Project Gutenberg CD/DVD images [400MB; 4GB; 8GB]
Reddit Comments Corpus [150GB compressed] *
Two months of NASA web logs [36MB compressed]
Page view statistics for Wikimedia projects [∼1GB/day compressed]
Global Historical Climatology Network [∼200MB/year compressed] *
Wikipedia page-to-page link database [1GB compressed]
Movie Tweetings [14MB]

* Greg has this data set: if you want to avoid a big download, ask for it.

Other Data Sets

Other data sets that we didn't use for assignments, but might be interesting for the project:

Wikidata: structured data from Wikipedia [36GB compressed] *
Reddit Comment and Submission Corpus *
Canada Federal Election 2021 results
Iowa Liquor Sales dataset [800MB]
GeoNames data dump [1.3GB uncompressed]
Wikipedia database downloads [12GB compressed] *
OpenStreetMap database dump [80GB compressed] *
Stack Exchange Data Dump [25GB]
Yelp Academic Data Set
DBLP article dataset
Amazon product data
Inside AirBnB
Common Crawl Corpus
Statistics Canada Developer resources

Other Lists

Reddit /r/datasets
Public data sets on AWS S3
Archive.org data set collection
Academic Torrents: “making [many] TB of research data available”
Open Science Data Cloud: “Repository for public data sets of scientific interest”
Great Github list of public data sets
City of Vancouver Open Data catalogue
Canada Open Government
Opendatasoft
Machine learning data sources (not all big, but maybe they contain inspiration):

Updated Tue Aug. 26 2025, 14:15 by ggbaker.