Running Spark Jobs Locally
It's generally much easier to test your code locally (on a smaller data set, one assumes) before uploading to the Cluster. Fortunately, Spark makes that easy.
Local Spark Jobs: your computer (Linux, OSX)
This assumes a Linux-like environment. I believe these instructions work more-or-less the same on OSX.
From the Spark download page, get Spark version 3.4.0 (which is the version we'll be using on the cluster), “Pre-built for Hadoop 2.7 and later”, and click the “download Spark” link. Unpack that somewhere you like. Set a couple of environment variables so things start correctly. (This must be done each time you log in/create a new terminal.)
export PYSPARK_PYTHON=python3
export PATH=${PATH}:/home/you/spark-3.4.0-bin-hadoop3/bin
Then you can start the pyspark shell or a standalone job:
pyspark
spark-submit sparkcode.py
While the job is running, you can access the web frontend at http://localhost:4040/.
If you're using the pyspark shell and want the IPython REPL instead of the plain Python REPL, you can set this environment variable:
export PYSPARK_DRIVER_PYTHON=ipython3
Local Spark Jobs: OSX
See How to Install PySpark on Mac, which seems to be a good set of instructions.
Local Spark Jobs: your computer with pip
In theory, Spark can be pip-installed:
pip3 install --user pyspark
… and then use the pyspark
and spark-submit
commands as described above.
I haven't had good luck with pip + pyspark in the past, but they may have updated their installer on the Spark side. Feedback appreciated.
Local Spark Jobs: CSIL Linux
Spark is installed on the CSIL Linux workstations (to run in local-only mode). You need to specify a Java runtime and that you're using Python 3, but then the standard commands should work:
PATH=/usr/shared/CMPT/big-data/spark-3.3.0-bin-hadoop3/bin/:${PATH}
export PYSPARK_PYTHON=python3
pyspark
spark-submit sparkcode.py
While the job is running, you can access the web frontend at http://localhost:4040/.
If you're using the pyspark shell and want the IPython REPL instead of the plain Python REPL, you can set this environment variable:
export PYSPARK_DRIVER_PYTHON=ipython3