Data Sets
This is a collection of publicly-available data sets that that we have used/will use (whole or in part) for assignments in this course:
- Project Gutenberg CD/DVD images [400MB; 4GB; 8GB]
- Reddit Comments Corpus [150GB compressed] *
- Two months of NASA web logs [36MB compressed]
- Page view statistics for Wikimedia projects [∼1GB/day compressed]
- Global Historical Climatology Network [∼200MB/year compressed] *
- Wikipedia page-to-page link database [1GB compressed]
- Movie Tweetings [14MB]
* Greg has this data set: if you want to avoid a big download, ask for it.
Other Data Sets
Other data sets that we didn't use for assignments, but might be interesting for the project:
- Wikidata: structured data from Wikipedia [36GB compressed] *
- Reddit Comment and Submission Corpus *
- Canada Federal Election 2021 results
- Iowa Liquor Sales dataset [800MB]
- GeoNames data dump [1.3GB uncompressed]
- Wikipedia database downloads [12GB compressed] *
- OpenStreetMap database dump [80GB compressed] *
- Stack Exchange Data Dump [25GB]
- Yelp Academic Data Set
- DBLP article dataset
- Amazon product data
- Inside AirBnB
- Common Crawl Corpus
- Statistics Canada Developer resources
Other Lists
- Reddit /r/datasets
- Public data sets on AWS S3
- Archive.org data set collection
- Academic Torrents: “making [many] TB of research data available”
- Open Science Data Cloud: “Repository for public data sets of scientific interest”
- Great Github list of public data sets
- City of Vancouver Open Data catalogue
- Canada Open Government
- Opendatasoft
- Machine learning data sources (not all big, but maybe they contain inspiration):
Updated Wed Nov. 20 2024, 12:37 by ggbaker.