Libraries on the Cluster
If you need third-party (pip-installed) libraries on the cluster, it's possible, but there's a little work to be done. Your home directory is shared on all of the cluster nodes, but the Spark executors are running as a different user that your own.
Make sure everything is installed in your home directory. You can do that by installing the modules you need like this:
pip3 install --user --force-reinstall --ignore-installed pygpx
Make sure that directory is readable by the executor processes (that aren't running as your userid):
chmod 0711 ~ ~/.local ~/.local/lib chmod 0755 ~/.local/lib/python3.8 ~/.local/lib/python3.8/site-packages
module load 353 that you're doing sets some environment variables that will make sure your Spark jobs find these libraries. Just for the record, what it specifically does is:
NLTK Data Files
If you're using NLTK, the above should be enough to get the module loaded, but the data files it needs are more work. First, download the ones you need, which can be done with a command like:
python3 -m nltk.downloader -d /home/youruserid/nltk_data large_grammars chmod 0755 /home/youruserid/nltk_data
Then in your code, make sure NLTK searches within that directory. This must be done in your UDF so it happens on the executor.
If you have other files that it needs (or ZIP files that it can't uncompress on the fly), you can generally override the default location for the data while creating your objects, something like:
sid = SentimentIntensityAnalyzer(lexicon_file='vader_lexicon.txt')