Not logged in. Login

Spark Code Skeleton

This is my general template for a Spark job. It makes sure we have reasonable Python and Spark versions (i.e. have done the module load on the cluster), and creates a main function that we can easily short-circuit return from if we want to inspect what's going on mid-program.

import sys
assert sys.version_info >= (3, 8) # make sure we have Python 3.8+
from pyspark.sql import SparkSession, functions, types

# add more functions as necessary

def main(inputs, output):
    # main logic starts here

if __name__ == '__main__':
    inputs = sys.argv[1]
    output = sys.argv[2]
    spark = SparkSession.builder.appName('example code').getOrCreate()
    assert spark.version >= '3.4' # make sure we have Spark 3.4+
    spark.sparkContext.setLogLevel('WARN')
    #sc = spark.sparkContext

    main(inputs, output)
Updated Fri Oct. 20 2023, 09:35 by ggbaker.