Spark + S3

You can either give your S3 credentials in code by setting the fs.s3a.access.key and fs.s3a.secret.key config values as in this code example, or by setting environment variables.

SparkSession.builder.… \
    .config('fs.s3a.access.key', 'your-s3-access-key') \
    .config('fs.s3a.secret.key', 'your-s3-secret-key') \
    .config('fs.s3a.endpoint', 'http://s3-us-west-2.amazonaws.com') \
    .getOrCreate()

inputs = 's3a://your-bucket-name/input-data/'
output = 's3a://your-bucket-name/output/'

df = spark.read.csv(inputs, schema=the_schema)
df.write.csv(output)

If you would like to specify credentials in environment variables, you can leave out the .config settings above and:

export AWS_ACCESS_KEY_ID="your-s3-access-key"
export AWS_SECRET_KEY="your-s3-secret-key"

Then when running your Spark job, include the relevant jars from the Hadoop distribution. On our cluster, that is (with version number possibly drifting as we update):

spark-submit --jars /opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.901.jar,/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.3.1.jar the_code.py

If you'd like to run locally, you can find the relevant jars/version in the share/hadoop/tools/lib directory from the Hadoop distribution.

Updated Thu Aug. 22 2024, 11:06 by ggbaker.

Simon Fraser University
Engaging the World

CourSys

Spark + S3