Cassandra + Spark + Python
We will use the spark-cassandra-connector to bring Spark and Cassandra together. When you run a Spark job using this library, you need to include the corresponding Spark Package:
spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.4.0 --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions …
You need to configure the SparkSession object to connect correctly to our cluster. Create your spark
variable like this:
cluster_seeds = ['node1.local', 'node2.local']
spark = SparkSession.builder.appName('Spark Cassandra example') \
.config('spark.cassandra.connection.host', ','.join(cluster_seeds)).getOrCreate()
With this done, you should be able to read DataFrames from Cassandra or write DataFrames tables to Cassandra:
df = spark.read.format("org.apache.spark.sql.cassandra") \
.options(table=table, keyspace=keyspace).load()
df.write.format("org.apache.spark.sql.cassandra") \
.options(table=table, keyspace=keyspace).save()
Updated Thu Aug. 22 2024, 11:06 by ggbaker.