spark-streaming

Exception in thread “main” org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leaders for Set()-Spark Steaming-kafka

非 Y 不嫁゛ 提交于 2020-01-25 05:33:25
问题 I am working on a data pipeline which takes tweets from Twitter4j -> publishes those tweets to a topic in Kafka -> Spark Streaming subscribes those tweets for processing. But when I run the code I am getting the exception - Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leaders for Set([LiveTweets,0]) The code is - import java.util.HashMap import java.util.Properties import twitter4j._ import twitter4j.FilterQuery; import twitter4j

What is best approach to join data in spark streaming application?

我怕爱的太早我们不能终老 提交于 2020-01-23 17:19:37
问题 Question : Essentially it means , rather than running a join of C* table for each streaming records , is there anyway to run a join for each micro-batch ( micro-batching ) of records in spark streaming ? We are almost finalized to use spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. But have one fundamental question regarding the efficiency in the below scenario. For the streaming data records(i.e. streamingDataSet ) , I need to look up for existing

how to properly use pyspark to send data to kafka broker?

↘锁芯ラ 提交于 2020-01-22 06:45:15
问题 I'm trying to write a simple pyspark job, which would receive data from a kafka broker topic, did some transformation on that data, and put the transformed data on a different kafka broker topic. I have the following code, which reads data from a kafka topic, but has no effect running sendkafka function: from pyspark import SparkConf, SparkContext from operator import add import sys from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils import json from

Spark Streaming App stucks while writing and reading to/from Cassandra simultaneously

偶尔善良 提交于 2020-01-17 08:11:28
问题 I was doing some benchmarking that consists of the following data flow: Kafka --> Spark Streaming --> Cassandra --> Prestodb Infrastructure : My spark streaming application runs on 4 executors (2 cores 4g of memory each). Each executor runs on a datanode wherein Cassandra is installed. 4 PrestoDB workers are also co-located in the datanodes. My cluster has 5 nodes, each of them with an Intel core i5, 32GB of DDR3 RAM, 500GB SSD and 1gigabit network. Spark streaming application : My Spark

How to integrate Ganglia for Spark 2.1 Job metrics, Spark ignoring Ganglia metrics

让人想犯罪 __ 提交于 2020-01-17 08:00:12
问题 I am trying to integrate Spark 2.1 job's metrics to Ganglia. My spark-default.conf looks like *.sink.ganglia.class org.apache.spark.metrics.sink.GangliaSink *.sink.ganglia.name Name *.sink.ganglia.host $MASTERIP *.sink.ganglia.port $PORT *.sink.ganglia.mode unicast *.sink.ganglia.period 10 *.sink.ganglia.unit seconds When i submit my job i can see the warn Warning: Ignoring non-spark config property: *.sink.ganglia.host=host Warning: Ignoring non-spark config property: *.sink.ganglia.name

Kafka Stream to Spark Stream python

青春壹個敷衍的年華 提交于 2020-01-15 12:15:08
问题 We have Kafka stream which use Avro. I need to connect it to Spark Stream. I use bellow code as Lev G suggest. kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}, valueDecoder=MessageSerializer.decode_message) I got bellow error when i execute it through spark-submit. 2018-10-09 10:49:27 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove executor 12 for reason Container marked as failed: container_1537396420651_0008_01_000013 on

Reading avro messages from Kafka in spark streaming/structured streaming

Deadly 提交于 2020-01-15 10:07:09
问题 I am using pyspark for the first time. Spark Version : 2.3.0 Kafka Version : 2.2.0 I have a kafka producer which sends nested data in avro format and I am trying to write code in spark-streaming/ structured streaming in pyspark which will deserialize the avro coming from kafka into dataframe do transformations write it in parquet format into s3. I was able to find avro converters in spark/scala but support in pyspark has not yet been added. How do I convert the same in pyspark. Thanks. 回答1:

Online (incremental) logistic regression in Spark [duplicate]

笑着哭i 提交于 2020-01-15 08:16:10
问题 This question already has answers here : Whether we can update existing model in spark-ml/spark-mllib? (2 answers) Closed 11 months ago . In Spark MLlib (RDD-based API) there is the StreamingLogisticRegressionWithSGD for incremental training of a Logistic Regression model. However, this class has been deprecated and offers little functionality (eg no access to model coefficients and output probabilities). In Spark ML (DataFrame-based API) I only find the class LogisticRegression , having only

Running in “deadlock” while doing streaming aggregations from Kafka

别来无恙 提交于 2020-01-15 07:12:52
问题 I posted another question with a similar regards a few days ago: How to load history data when starting Spark Streaming process, and calculate running aggregations I managed to get at least a "working" solution now, meaning that the process itself seems to work correctly. But, as I am a bloody beginner concerning Spark, I seem to have missed some things on how to build these kind of applications in a correct way (performance-/computational-wise)... What I want to do: Load history data from

Running in “deadlock” while doing streaming aggregations from Kafka

爷,独闯天下 提交于 2020-01-15 07:12:03
问题 I posted another question with a similar regards a few days ago: How to load history data when starting Spark Streaming process, and calculate running aggregations I managed to get at least a "working" solution now, meaning that the process itself seems to work correctly. But, as I am a bloody beginner concerning Spark, I seem to have missed some things on how to build these kind of applications in a correct way (performance-/computational-wise)... What I want to do: Load history data from