delta-lake | 易学教程

How to stream data from Kafka topic to Delta table using Spark Structured Streaming

阅读更多关于 How to stream data from Kafka topic to Delta table using Spark Structured Streaming

问题 I'm trying to understand databricks delta and thinking to do a POC using Kafka. Basically the plan is to consume data from Kafka and insert it to the databricks delta table. These are the steps that I did: Create a delta table on databricks. %sql CREATE TABLE hazriq_delta_trial2 ( value STRING ) USING delta LOCATION '/delta/hazriq_delta_trial2' Consume data from Kafka. import org.apache.spark.sql.types._ val kafkaBrokers = "broker1:port,broker2:port,broker3:port" val kafkaTopic = "kafkapoc"

How to stream data from Kafka topic to Delta table using Spark Structured Streaming

阅读更多关于 How to stream data from Kafka topic to Delta table using Spark Structured Streaming

Read file path from Kafka topic and then read file and write to DeltaLake in Structured Streaming

阅读更多关于 Read file path from Kafka topic and then read file and write to DeltaLake in Structured Streaming

问题 I have a use case where the file path of the json records stored in s3 are coming as a kafka message in kafka. I have to process the data using spark structured streaming. The design which I thought is as follows: In kafka Spark structured streaming, read the message containing the data path. Collect the message record in driver. (Messages are small in sizes) Create the dataframe from the data location. kafkaDf.select($"value".cast(StringType)) .writeStream.foreachBatch((batchDf:DataFrame,

How to get Last 1 hour data, every 5 minutes, without grouping?

阅读更多关于 How to get Last 1 hour data, every 5 minutes, without grouping?

问题 How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is : Read the stream, filter data for last 1 hr based on timestamp column, and write/print using forEachbatch . And watermark it so that it does not hold on to all the past data. spark. readStream.format("delta").table("xxx") .withWatermark("ts", "60 minutes") .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))

How to get Last 1 hour data, every 5 minutes, without grouping?

阅读更多关于 How to get Last 1 hour data, every 5 minutes, without grouping?

How to get Last 1 hour data, every 5 minutes, without grouping?

阅读更多关于 How to get Last 1 hour data, every 5 minutes, without grouping?

Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs

阅读更多关于 Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs

来源： https://stackoverflow.com/questions/62822265/delta-lake-oss-table-on-emr-and-s3-vacuum-takes-a-long-time-with-no-jobs

Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs

阅读更多关于 Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs

来源： https://stackoverflow.com/questions/62822265/delta-lake-oss-table-on-emr-and-s3-vacuum-takes-a-long-time-with-no-jobs

Can underlying parquet files be deleted without negatively impacting DeltaLake _delta_log

阅读更多关于 Can underlying parquet files be deleted without negatively impacting DeltaLake _delta_log

来源： https://stackoverflow.com/questions/63541030/can-underlying-parquet-files-be-deleted-without-negatively-impacting-deltalake

Querying Delta Lake from Inside of UDF in Databricks

阅读更多关于 Querying Delta Lake from Inside of UDF in Databricks

来源： https://stackoverflow.com/questions/63147904/querying-delta-lake-from-inside-of-udf-in-databricks