delta-lake

How to stream data from Kafka topic to Delta table using Spark Structured Streaming

纵饮孤独 提交于 2021-02-04 18:09:05
问题 I'm trying to understand databricks delta and thinking to do a POC using Kafka. Basically the plan is to consume data from Kafka and insert it to the databricks delta table. These are the steps that I did: Create a delta table on databricks. %sql CREATE TABLE hazriq_delta_trial2 ( value STRING ) USING delta LOCATION '/delta/hazriq_delta_trial2' Consume data from Kafka. import org.apache.spark.sql.types._ val kafkaBrokers = "broker1:port,broker2:port,broker3:port" val kafkaTopic = "kafkapoc"

How to stream data from Kafka topic to Delta table using Spark Structured Streaming

核能气质少年 提交于 2021-01-29 15:11:34
问题 I'm trying to understand databricks delta and thinking to do a POC using Kafka. Basically the plan is to consume data from Kafka and insert it to the databricks delta table. These are the steps that I did: Create a delta table on databricks. %sql CREATE TABLE hazriq_delta_trial2 ( value STRING ) USING delta LOCATION '/delta/hazriq_delta_trial2' Consume data from Kafka. import org.apache.spark.sql.types._ val kafkaBrokers = "broker1:port,broker2:port,broker3:port" val kafkaTopic = "kafkapoc"

Read file path from Kafka topic and then read file and write to DeltaLake in Structured Streaming

删除回忆录丶 提交于 2021-01-24 18:56:54
问题 I have a use case where the file path of the json records stored in s3 are coming as a kafka message in kafka. I have to process the data using spark structured streaming. The design which I thought is as follows: In kafka Spark structured streaming, read the message containing the data path. Collect the message record in driver. (Messages are small in sizes) Create the dataframe from the data location. kafkaDf.select($"value".cast(StringType)) .writeStream.foreachBatch((batchDf:DataFrame,

How to get Last 1 hour data, every 5 minutes, without grouping?

僤鯓⒐⒋嵵緔 提交于 2020-12-30 03:13:27
问题 How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is : Read the stream, filter data for last 1 hr based on timestamp column, and write/print using forEachbatch . And watermark it so that it does not hold on to all the past data. spark. readStream.format("delta").table("xxx") .withWatermark("ts", "60 minutes") .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))

How to get Last 1 hour data, every 5 minutes, without grouping?

戏子无情 提交于 2020-12-30 03:12:14
问题 How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is : Read the stream, filter data for last 1 hr based on timestamp column, and write/print using forEachbatch . And watermark it so that it does not hold on to all the past data. spark. readStream.format("delta").table("xxx") .withWatermark("ts", "60 minutes") .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))

How to get Last 1 hour data, every 5 minutes, without grouping?

妖精的绣舞 提交于 2020-12-30 02:59:06
问题 How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is : Read the stream, filter data for last 1 hr based on timestamp column, and write/print using forEachbatch . And watermark it so that it does not hold on to all the past data. spark. readStream.format("delta").table("xxx") .withWatermark("ts", "60 minutes") .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))