Cassandra + Spark + Python

We will use the spark-cassandra-connector to bring Spark and Cassandra together. When you run a Spark job using this library, you need to include the corresponding Spark Package:

spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.4.0 --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions …

You need to configure the SparkSession object to connect correctly to our cluster. Create your spark variable like this:

cluster_seeds = ['node1.local', 'node2.local']
spark = SparkSession.builder.appName('Spark Cassandra example') \
    .config('spark.cassandra.connection.host', ','.join(cluster_seeds)).getOrCreate()

With this done, you should be able to read DataFrames from Cassandra or write DataFrames tables to Cassandra:

df = spark.read.format("org.apache.spark.sql.cassandra") \
    .options(table=table, keyspace=keyspace).load()
df.write.format("org.apache.spark.sql.cassandra") \
    .options(table=table, keyspace=keyspace).save()

Updated Thu Aug. 22 2024, 11:06 by ggbaker.

Simon Fraser University
Engaging the World

CourSys

Cassandra + Spark + Python