Not logged in. Login

Project Notes

Scraping Data

If you're getting any data from any kind of online source, make sure you download it in a way that ensures you only have to do it once.

One easy way to do that is the Python shelve module that gives you a dictionary-like data structure that is automatically persisted to disk (with the limitation that the keys must be strings, and the values must be serializable with pickle).

Code like this will fetch data for a collection of keys, ensuring that only not-previously-fetched keys are requested.

import shelve
datafile = "project_data.shelf"

with shelve.open(datafile, 'c') as data:
    for k in keys_to_fetch:
        key = str(k)
        if key not in data:
            print(f'Requesting {k}...')
            d = my_api_call(k)
            data[key] = d

Then code like this can get the data into a more usable format (and is fast to re-run if you miss something).

import pandas as pd
import shelve
datafile = "project_data.shelf"

thing1 = []
thing2 = []
with shelve.open(datafile, 'c') as data:
    for k, v in data.items():
        thing1.append(v['thing1'])
        thing2.append(v['thing2'])

df = pd.DataFrame({'thing1': thing1, 'thing2': thing2})
df.to_json('data.ndjson', orient='records', lines=True)
Updated Wed July 03 2024, 18:02 by ggbaker.