Running Spark

Spark Local: your computer with pip

In theory, you should just be able to:

pip install pyspark

You may have to log out and back in (to get the place it puts the programs in your path correctly).

Spark Local: your computer downloaded distro

From the Spark download page, get Spark version 4.0.1 (the version we'll be using on the cluster), “Pre-built for Hadoop 3.2 and later”, and click the “download Spark” link. Unpack that somewhere you like. Set an environment variable so you can find it easily later:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export SPARK_HOME=/home/you/spark-4.0.1-bin-hadoop3/
export PYSPARK_PYTHON=python3

If you have iPython installed, you can also export PYSPARK_DRIVER_PYTHON=ipython to use it in the pyspark shell.

Then you can start the pyspark shell or a standalone job like this:

${SPARK_HOME}/bin/pyspark
${SPARK_HOME}/bin/spark-submit sparkcode.py

While the job is running, you can access the web frontend at http://localhost:4040/.

Spark Local: lab computer

Spark is already available on the machines. Set an environment variable so you can get to it easily later:

export JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64
export SPARK_HOME=/usr/shared/CMPT/big-data/spark-4.0.1-bin-hadoop3
export PYSPARK_PYTHON=python3

Then you can start the pyspark shell or a standalone job like this:

${SPARK_HOME}/bin/pyspark
${SPARK_HOME}/bin/spark-submit sparkcode.py

While the job is running, you can access the web frontend at http://localhost:4040/.

If you want to use PyPy as your Python implementation,

export PYSPARK_PYTHON=/usr/shared/CMPT/big-data/pypy3.10-v7.3.19-linux64/bin/pypy3

Cluster

Spark will be in your path and you can get started:

pyspark
spark-submit sparkcode.py

Monitoring Jobs

In the YARN web front end (http://localhost:8088 if you have your ports forwarded as in the Cluster instructions), you can click your app while it's running, then the “ApplicationMaster” link.

If you're on campus, the link will work. If not, you can replace “controller.local” with “localhost” in the URL and it should load. (Or if you really want, in your OS' /etc/hosts file, add 127.0.0.1 controller.local and the links will work.)

After the job has finished, you can also use the yarn logs command to get the stdout and stderr from your jobs, as described in the Cluster instructions.

Spark and PyPy

PyPy is a Python implementation that includes a Just-In-Time compiler that can be astonishingly fast. It can be used with Spark to speed up the Python code execution. (In Python Spark, your logic is split between the Scala/JVM implementation of the core logic and the Python implementation of your logic and parts of the PySpark API.)

In general for Spark, you need to set the PYSPARK_PYTHON variable to the command to start the pypy executable.

On the cluster, you can do this:

module load spark-pypy

This sets the PYSPARK_PYTHON variable to point to PyPy and SPARK_YARN_USER_ENV so the PYSPARK_PYTHON is set on the executors as well.

Updated Thu Sept. 18 2025, 10:15 by ggbaker.

Simon Fraser University
Engaging the World

CourSys