Spark Code Skeleton
This is my general template for a Spark job. It makes sure we have reasonable Python and Spark versions (i.e. have done the
module load on the cluster), and creates a
main function that we can easily short-circuit
return from if we want to inspect what's going on mid-program.
import sys assert sys.version_info >= (3, 5) # make sure we have Python 3.5+ from pyspark.sql import SparkSession, functions, types # add more functions as necessary def main(inputs, output): # main logic starts here if __name__ == '__main__': inputs = sys.argv output = sys.argv spark = SparkSession.builder.appName('example code').getOrCreate() assert spark.version >= '3.1' # make sure we have Spark 3.1+ spark.sparkContext.setLogLevel('WARN') #sc = spark.sparkContext main(inputs, output)