External Libraries & Spark

If you need third-party (pip-installed) libraries on the cluster, it's possible, but there's a little work to be done. Your home directory is shared on all of the cluster nodes, but the Spark executors are running as a different user that your own.

Make sure everything is installed in your home directory. You can do that by installing the modules you need like this:

pip3 install --user --force-reinstall --ignore-installed pygpx

Make sure that directory is readable by the executor processes (that aren't running as your userid):

chmod 0711 ~ ~/.local ~/.local/lib
chmod 0755 ~/.local/lib/python3.10 ~/.local/lib/python3.10/site-packages

NLTK Data Files

If you're using NLTK, the above should be enough to get the module loaded, but the data files it needs are more work. First, download the ones you need, which can be done with a command like:

python3 -m nltk.downloader -d /home/youruserid/nltk_data large_grammars
chmod 0755 /home/youruserid/nltk_data

Then in your code, make sure NLTK searches within that directory. This must be done in your UDF so it happens on the executor.

nltk.data.path.append('/home/youruserid/nltk_data')

If you have other files that it needs (or ZIP files that it can't uncompress on the fly), you can generally override the default location for the data while creating your objects, something like:

sid = SentimentIntensityAnalyzer(lexicon_file='vader_lexicon.txt')

Updated Thu Aug. 22 2024, 11:06 by ggbaker.

Simon Fraser University
Engaging the World

CourSys

External Libraries & Spark

NLTK Data Files