Spark + S3
You can either give your S3 credentials in code by setting the fs.s3a.access.key
and fs.s3a.secret.key
config values as in this code example, or by setting environment variables.
SparkSession.builder.… \
.config('fs.s3a.access.key', 'your-s3-access-key') \
.config('fs.s3a.secret.key', 'your-s3-secret-key') \
.config('fs.s3a.endpoint', 'http://s3-us-west-2.amazonaws.com') \
.getOrCreate()
inputs = 's3a://your-bucket-name/input-data/'
output = 's3a://your-bucket-name/output/'
df = spark.read.csv(inputs, schema=the_schema)
df.write.csv(output)
If you would like to specify credentials in environment variables, you can leave out the .config
settings above and:
export AWS_ACCESS_KEY_ID="your-s3-access-key"
export AWS_SECRET_KEY="your-s3-secret-key"
Then when running your Spark job, include the relevant jars from the Hadoop distribution. On our cluster, that is (with version number possibly drifting as we update):
spark-submit --jars /opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.901.jar,/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.3.1.jar the_code.py
If you'd like to run locally, you can find the relevant jars/version in the share/hadoop/tools/lib
directory from the Hadoop distribution.
Updated Thu Aug. 22 2024, 11:06 by ggbaker.