# Running Spark

From the Spark download page, get Spark version 3.3.0 (the version we'll be using on the cluster), “Pre-built for Hadoop 3.2 and later”, and click the “download Spark” link. Unpack that somewhere you like. Set an environment variable so you can find it easily later:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PYSPARK_PYTHON=python3


If you have iPython installed, you can also export PYSPARK_DRIVER_PYTHON=ipython to use it in the pyspark shell.

Then you can start the pyspark shell or a standalone job like this:

${SPARK_HOME}/bin/pyspark${SPARK_HOME}/bin/spark-submit sparkcode.py


While the job is running, you can access the web frontend at http://localhost:4040/.

## Spark Local: lab computer

Spark is already available on the machines. Set an environment variable so you can get to it easily later:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PYSPARK_PYTHON=python3


Then you can start the pyspark shell or a standalone job like this:

${SPARK_HOME}/bin/pyspark${SPARK_HOME}/bin/spark-submit sparkcode.py


While the job is running, you can access the web frontend at http://localhost:4040/.

## Cluster

Spark will be in your path and you can get started:

pyspark
spark-submit sparkcode.py


### Monitoring Jobs

In the YARN web front end (http://localhost:8088 if you have your ports forwarded as in the Cluster instructions), you can click your app while it's running, then the “ApplicationMaster” link.

If you're on campus, the link will work. If not, you can replace “controller.local” with “localhost” in the URL and it should load. (Or if you really want, in your OS' /etc/hosts file, add 127.0.0.1 controller.local and the links will work.)

After the job has finished, you can also use the yarn logs command to get the stdout and stderr from your jobs, as described in the Cluster instructions.

## Spark and PyPy

PyPy is a Python implementation that includes a Just-In-Time compiler that can be astonishingly fast. It can be used with Spark to speed up the Python code execution. (In Python Spark, your logic is split between the Scala/JVM implementation of the core logic and the Python implementation of your logic and parts of the PySpark API.)

In general for Spark, you need to set the PYSPARK_PYTHON variable to the command to start the pypy executable.

On the cluster, you can do this:

module load spark-pypy


This sets the PYSPARK_PYTHON variable to point to PyPy and SPARK_YARN_USER_ENV so the PYSPARK_PYTHON is set on the executors as well.

Updated Wed Sept. 28 2022, 10:59 by ggbaker.