Project: Scraping Data

If you're getting any data from any kind of online source, make sure you download it in a way that ensures you only have to do it once.

One easy way to do that is the Python shelve module that gives you a dictionary-like data structure that is automatically persisted to disk (with the limitation that the keys must be strings, and the values must be serializable with pickle).

Code like this will fetch data for a collection of keys, ensuring that only not-previously-fetched keys are requested.

import shelve
datafile = "project_data.shelf"

with shelve.open(datafile, 'c') as data:
    for k in keys_to_fetch:
        key = str(k)
        if key not in data:
            print(f'Requesting {k}...')
            d = my_api_call(k)
            data[key] = d

If you're scraping HTML, the my_api_call may be something like urllib.request.urlopen(url).read() that will fetch the HTML. You can disassemble it to get the data you want in a subsequent step.

Then code like this can get the data into a more usable format (and is fast to re-run if you miss something).

import pandas as pd
import shelve
datafile = "project_data.shelf"

thing1 = []
thing2 = []
with shelve.open(datafile, 'c') as data:
    for k, v in data.items():
        thing1.append(v['thing1'])
        thing2.append(v['thing2'])

df = pd.DataFrame({'thing1': thing1, 'thing2': thing2})
df.to_json('data.ndjson', orient='records', lines=True)

Updated Wed June 25 2025, 13:32 by ggbaker.

Simon Fraser University
Engaging the World

CourSys

Project: Scraping Data