Running Spark
Spark Local: your computer with pip
In theory, you should just be able to:
pip install pyspark
You may have to log out and back in (to get the place it puts the programs in your path correctly).
Spark Local: your computer downloaded distro
From the Spark download page, get Spark version 3.3.0 (the version we'll be using on the cluster), “Pre-built for Hadoop 3.2 and later”, and click the “download Spark” link. Unpack that somewhere you like. Set an environment variable so you can find it easily later:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 export SPARK_HOME=/home/you/spark-3.3.0-bin-hadoop3/ export PYSPARK_PYTHON=python3
If you have iPython installed, you can also export PYSPARK_DRIVER_PYTHON=ipython
to use it in the pyspark shell.
Then you can start the pyspark shell or a standalone job like this:
${SPARK_HOME}/bin/pyspark ${SPARK_HOME}/bin/spark-submit sparkcode.py
While the job is running, you can access the web frontend at http://localhost:4040/.
Spark Local: lab computer
Spark is already available on the machines. Set an environment variable so you can get to it easily later:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 export SPARK_HOME=/usr/shared/CMPT/big-data/spark-3.3.0-bin-hadoop3 export PYSPARK_PYTHON=python3
Then you can start the pyspark shell or a standalone job like this:
${SPARK_HOME}/bin/pyspark ${SPARK_HOME}/bin/spark-submit sparkcode.py
While the job is running, you can access the web frontend at http://localhost:4040/.
Cluster
Spark will be in your path and you can get started:
pyspark spark-submit sparkcode.py
Monitoring Jobs
In the YARN web front end (http://localhost:8088 if you have your ports forwarded as in the Cluster instructions), you can click your app while it's running, then the “ApplicationMaster” link.
If you're on campus, the link will work. If not, you can replace “controller.local” with “localhost” in the URL and it should load. (Or if you really want, in your OS' /etc/hosts
file, add 127.0.0.1 controller.local
and the links will work.)
After the job has finished, you can also use the yarn logs
command to get the stdout and stderr from your jobs, as described in the Cluster instructions.
Spark and PyPy
PyPy is a Python implementation that includes a Just-In-Time compiler that can be astonishingly fast. It can be used with Spark to speed up the Python code execution. (In Python Spark, your logic is split between the Scala/JVM implementation of the core logic and the Python implementation of your logic and parts of the PySpark API.)
In general for Spark, you need to set the PYSPARK_PYTHON
variable to the command to start the pypy
executable.
On the cluster, you can do this:
module load spark-pypy
This sets the PYSPARK_PYTHON
variable to point to PyPy and SPARK_YARN_USER_ENV
so the PYSPARK_PYTHON
is set on the executors as well.