Spark Code Skeleton
This is my general template for a Spark job. It makes sure we have reasonable Python and Spark versions (i.e. have done the module load
on the cluster), and creates a main
function that we can easily short-circuit return
from if we want to inspect what's going on mid-program.
import sys
assert sys.version_info >= (3, 5) # make sure we have Python 3.5+
from pyspark.sql import SparkSession, functions, types
# add more functions as necessary
def main(inputs, output):
# main logic starts here
if __name__ == '__main__':
inputs = sys.argv[1]
output = sys.argv[2]
spark = SparkSession.builder.appName('example code').getOrCreate()
assert spark.version >= '3.1' # make sure we have Spark 3.1+
spark.sparkContext.setLogLevel('WARN')
#sc = spark.sparkContext
main(inputs, output)
Updated Sun Dec. 10 2023, 21:47 by sbergner.