spark-structured-streaming

Cassandra Sink for PySpark Structured Streaming from Kafka topic

久未见 提交于 2021-02-04 16:34:14
问题 I want to write Structure Streaming Data into Cassandra using PySpark Structured Streaming API. My data flow is like below: REST API -> Kafka -> Spark Structured Streaming (PySpark) -> Cassandra Source and Version in below: Spark version: 2.4.3 DataStax DSE: 6.7.6-1 initialize spark: spark = SparkSession.builder\ .master("local[*]")\ .appName("Analytics")\ .config("kafka.bootstrap.servers", "localhost:9092")\ .config("spark.cassandra.connection.host","localhost:9042")\ .getOrCreate()

How to stream data from Kafka topic to Delta table using Spark Structured Streaming

核能气质少年 提交于 2021-01-29 15:11:34
问题 I'm trying to understand databricks delta and thinking to do a POC using Kafka. Basically the plan is to consume data from Kafka and insert it to the databricks delta table. These are the steps that I did: Create a delta table on databricks. %sql CREATE TABLE hazriq_delta_trial2 ( value STRING ) USING delta LOCATION '/delta/hazriq_delta_trial2' Consume data from Kafka. import org.apache.spark.sql.types._ val kafkaBrokers = "broker1:port,broker2:port,broker3:port" val kafkaTopic = "kafkapoc"

Spark structured streaming - ways to lookup high volume non-static dataset?

时间秒杀一切 提交于 2021-01-29 10:32:34
问题 I wish to build a spark structured streaming job that does something like below(lookup a huge non-static dataset) Read from kafka(json record) For each json record Get {user_key} Read from huge Phoenix table(non-static) filter by {user_key} Further DF transformations Write to another phoenix table How to lookup huge volume non-static dataset per kafka message? 来源: https://stackoverflow.com/questions/62421785/spark-structured-streaming-ways-to-lookup-high-volume-non-static-dataset

spark streaming kafka : Unknown error fetching data for topic-partition

自作多情 提交于 2021-01-29 10:31:12
问题 I'm trying to read a Kafka topic from a Spark cluster using Structured Streaming API with Kafka integration in Spark val sparkSession = SparkSession.builder() .master("local[*]") .appName("some-app") .getOrCreate() Kafka stream creation import sparkSession.implicits._ val dataFrame = sparkSession .readStream .format("kafka") .option("subscribepattern", "preprod-*") .option("kafka.bootstrap.servers", "<brokerUrl>:9094") .option("kafka.ssl.protocol", "TLS") .option("kafka.security.protocol",

How to refer a map column in a spark-sql query?

允我心安 提交于 2021-01-28 19:11:42
问题 scala> val map1 = spark.sql("select map('p1', 's1', 'p2', 's2')") map1: org.apache.spark.sql.DataFrame = [map(p1, s1, p2, s2): map<string,string>] scala> map1.show() +--------------------+ | map(p1, s1, p2, s2)| +--------------------+ |[p1 -> s1, p2 -> s2]| +--------------------+ scala> spark.sql("select element_at(map1, 'p1')") org.apache.spark.sql.AnalysisException: cannot resolve ' map1 ' given input columns: []; line 1 pos 18; 'Project [unresolvedalias('element_at('map1, p1), None)] How

org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

三世轮回 提交于 2021-01-27 14:23:40
问题 I'm trying to write a Spark Structured Streaming (2.3) dataset to ScyllaDB (Cassandra). My code to write the dataset: def saveStreamSinkProvider(ds: Dataset[InvoiceItemKafka]) = { ds .writeStream .format("cassandra.ScyllaSinkProvider") .outputMode(OutputMode.Append) .queryName("KafkaToCassandraStreamSinkProvider") .options( Map( "keyspace" -> namespace, "table" -> StreamProviderTableSink, "checkpointLocation" -> "/tmp/checkpoints" ) ) .start() } My ScyllaDB Streaming Sinks: class

Read file path from Kafka topic and then read file and write to DeltaLake in Structured Streaming

删除回忆录丶 提交于 2021-01-24 18:56:54
问题 I have a use case where the file path of the json records stored in s3 are coming as a kafka message in kafka. I have to process the data using spark structured streaming. The design which I thought is as follows: In kafka Spark structured streaming, read the message containing the data path. Collect the message record in driver. (Messages are small in sizes) Create the dataframe from the data location. kafkaDf.select($"value".cast(StringType)) .writeStream.foreachBatch((batchDf:DataFrame,

Apache Spark SQL get_json_object java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String

别说谁变了你拦得住时间么 提交于 2021-01-07 03:38:30
问题 I am trying to read a json stream from an MQTT broker in Apache Spark with structured streaming, read some properties of an incoming json and output them to the console. My code looks like that: val spark = SparkSession .builder() .appName("BahirStructuredStreaming") .master("local[*]") .getOrCreate() import spark.implicits._ val topic = "temp" val brokerUrl = "tcp://localhost:1883" val lines = spark.readStream .format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider") .option(

Apache Spark SQL get_json_object java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String

左心房为你撑大大i 提交于 2021-01-07 03:38:27
问题 I am trying to read a json stream from an MQTT broker in Apache Spark with structured streaming, read some properties of an incoming json and output them to the console. My code looks like that: val spark = SparkSession .builder() .appName("BahirStructuredStreaming") .master("local[*]") .getOrCreate() import spark.implicits._ val topic = "temp" val brokerUrl = "tcp://localhost:1883" val lines = spark.readStream .format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider") .option(

Apache Spark’s Structured Streaming with Google PubSub

为君一笑 提交于 2021-01-01 04:14:32
问题 I'm using Spark Dstream to pull and process data from Google PubSub. I'm looking for a way to move to structured streaming but still using Pub/Sub. Also, I should mention that my messages are Snappy compressed in Pub/Sub. I found this issue which claims that using Pub/Sub with structured streaming is not supported. Is someone has encountered this problem? Is it possible to implement custom Receiver to read the data from Pub/Sub Thanks 回答1: The feature request you referenced is still accurate: