spark-structured-streaming

Spark Structured Streaming Multiple WriteStreams to Same Sink

岁酱吖の 提交于 2020-12-29 13:55:40
问题 Two Writestream to the same database sink is not happening in sequence in Spark Structured Streaming 2.2.1. Please suggest how to make them execute in sequence. val deleteSink = ds1.writestream .outputMode("update") .foreach(mydbsink) .start() val UpsertSink = ds2.writestream .outputMode("update") .foreach(mydbsink) .start() deleteSink.awaitTermination() UpsertSink.awaitTermination() Using the above code, deleteSink is executed after UpsertSink . 回答1: If you want to have two streams running

Sending json events to kafka in non-stringified format

最后都变了- 提交于 2020-12-15 05:45:28
问题 I have created a dataframe like below, where I have used to_json() method to create JSON array value. +---------------------------------------------------------------------------------------------------- |json_data | +-----------------------------------------------------------------------------------------------------------+ |{"name":"sensor1","value-array":[{"time":"2020-11-27T01:01:00.000Z","sensorvalue":11.0,"tag1":"tagvalue"}]}| +-----------------------------------------------------------

How to control output files size in Spark Structured Streaming

。_饼干妹妹 提交于 2020-12-12 10:13:27
问题 We're considering using Spark Structured Streaming on a project. The input and output are parquet files on S3 bucket. Is it possible to control the size of the output files somehow? We're aiming at output files of size 10-100MB. As I understand, in traditional batch approach we could determine the output file sizes by adjusting the amount of partitions according to the size of the input dataset, is something similar possible in Structured Streaming? 回答1: In Spark 2.2 or later the optimal

How to control output files size in Spark Structured Streaming

≯℡__Kan透↙ 提交于 2020-12-12 10:12:53
问题 We're considering using Spark Structured Streaming on a project. The input and output are parquet files on S3 bucket. Is it possible to control the size of the output files somehow? We're aiming at output files of size 10-100MB. As I understand, in traditional batch approach we could determine the output file sizes by adjusting the amount of partitions according to the size of the input dataset, is something similar possible in Structured Streaming? 回答1: In Spark 2.2 or later the optimal

How to write two streaming df's into two different tables in MySQL in Spark sturctured streaming?

时光总嘲笑我的痴心妄想 提交于 2020-12-04 08:59:47
问题 I am using spark 2.3.2 Version. I have written code in spark structured streaming to insert streaming dataframes data into two different MySQL tables. Let say there are two streaming df's: DF1, DF2. I have written two queries(query1,query2) using foreachWriter API to write into MySQL tables from different streamings respectively. I.E. DF1 into MYSQLtable A and DF2 into MYSQL table B. When I run the spark job, first it runs query1 and then query2, so it's writing to table A but not into table

How to write two streaming df's into two different tables in MySQL in Spark sturctured streaming?

社会主义新天地 提交于 2020-12-04 08:57:36
问题 I am using spark 2.3.2 Version. I have written code in spark structured streaming to insert streaming dataframes data into two different MySQL tables. Let say there are two streaming df's: DF1, DF2. I have written two queries(query1,query2) using foreachWriter API to write into MySQL tables from different streamings respectively. I.E. DF1 into MYSQLtable A and DF2 into MYSQL table B. When I run the spark job, first it runs query1 and then query2, so it's writing to table A but not into table

How to perform Unit testing on Spark Structured Streaming?

南笙酒味 提交于 2020-11-29 10:56:26
问题 I would like to know about the unit testing side of Spark Structured Streaming. My scenario is, I am getting data from Kafka and I am consuming it using Spark Structured Streaming and applying some transformations on top of the data. I am not sure about how can I test this using Scala and Spark. Can someone tell me how to do unit testing in Structured Streaming using Scala. I am new to streaming. 回答1: tl;dr Use MemoryStream to add events and memory sink for the output. The following code