spark-streaming

How to safely restart Airflow and kill a long-running task?

删除回忆录丶 提交于 2021-01-07 06:18:58
问题 I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator. My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens: Associated task becomes a zombie (running state, but no process with heartbeat) Task is marked as failed when Airflow reaps zombies Spark streaming job continues

How to safely restart Airflow and kill a long-running task?

给你一囗甜甜゛ 提交于 2021-01-07 06:18:53
问题 I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator. My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens: Associated task becomes a zombie (running state, but no process with heartbeat) Task is marked as failed when Airflow reaps zombies Spark streaming job continues

A bad issue with kafka and Spark Streaming on Python

拟墨画扇 提交于 2021-01-07 02:45:47
问题 N.B. This is NOT the same issue that I had in my first post on this site, however it is the same project. I'm ingesting some files into PostgreSQL from kafka using spark streaming. These are my steps for the project: 1- Creating a script for the kafka producer (done, it works fine) 2- Creating a python script that reads files from kafka producer 3- Sending files to PostgreSQL For the connection between python and postgreSQL I use psycopg2. I am also using python 3 and java jdk1.8.0_261 and

A bad issue with kafka and Spark Streaming on Python

别等时光非礼了梦想. 提交于 2021-01-07 02:42:46
问题 N.B. This is NOT the same issue that I had in my first post on this site, however it is the same project. I'm ingesting some files into PostgreSQL from kafka using spark streaming. These are my steps for the project: 1- Creating a script for the kafka producer (done, it works fine) 2- Creating a python script that reads files from kafka producer 3- Sending files to PostgreSQL For the connection between python and postgreSQL I use psycopg2. I am also using python 3 and java jdk1.8.0_261 and

A bad issue with kafka and Spark Streaming on Python

戏子无情 提交于 2021-01-07 02:42:17
问题 N.B. This is NOT the same issue that I had in my first post on this site, however it is the same project. I'm ingesting some files into PostgreSQL from kafka using spark streaming. These are my steps for the project: 1- Creating a script for the kafka producer (done, it works fine) 2- Creating a python script that reads files from kafka producer 3- Sending files to PostgreSQL For the connection between python and postgreSQL I use psycopg2. I am also using python 3 and java jdk1.8.0_261 and

How to get Last 1 hour data, every 5 minutes, without grouping?

僤鯓⒐⒋嵵緔 提交于 2020-12-30 03:13:27
问题 How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is : Read the stream, filter data for last 1 hr based on timestamp column, and write/print using forEachbatch . And watermark it so that it does not hold on to all the past data. spark. readStream.format("delta").table("xxx") .withWatermark("ts", "60 minutes") .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))

How to get Last 1 hour data, every 5 minutes, without grouping?

戏子无情 提交于 2020-12-30 03:12:14
问题 How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is : Read the stream, filter data for last 1 hr based on timestamp column, and write/print using forEachbatch . And watermark it so that it does not hold on to all the past data. spark. readStream.format("delta").table("xxx") .withWatermark("ts", "60 minutes") .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))

How to get Last 1 hour data, every 5 minutes, without grouping?

妖精的绣舞 提交于 2020-12-30 02:59:06
问题 How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is : Read the stream, filter data for last 1 hr based on timestamp column, and write/print using forEachbatch . And watermark it so that it does not hold on to all the past data. spark. readStream.format("delta").table("xxx") .withWatermark("ts", "60 minutes") .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))

Convert Streaming XML into JSON in Spark

倖福魔咒の 提交于 2020-12-15 05:01:57
问题 I am new to Spark and working on a simple application to convert XML streams received from Kafka in to JSON format Using: Spark 2.4.5 Scala 2.11.12 In my use case kafka stream is in xml format). The Following is the code that I tried. val spark: SparkSession = SparkSession.builder() .master("local") .appName("Spark Demo") .getOrCreate() spark.sparkContext.setLogLevel("ERROR") val inputStream = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option(

Get the first elements (take function) of a DStream

别等时光非礼了梦想. 提交于 2020-12-13 09:35:50
问题 I look for a way to retrieve the first elements of a DStream created as: val dstream = ssc.textFileStream(args(1)).map(x => x.split(",").map(_.toDouble)) Unfortunately, there is no take function (as on RDD) on a dstream // dstream.take(2) !!! Could someone has any idea on how to do it ?! thanks 回答1: You can use transform method in the DStream object then take n elements of the input RDD and save it to a list, then filter the original RDD to be contained in this list. This will return a new