spark-streaming | 易学教程

Exception in thread “main” org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leaders for Set()-Spark Steaming-kafka

阅读更多关于 Exception in thread “main” org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leaders for Set()-Spark Steaming-kafka

问题 I am working on a data pipeline which takes tweets from Twitter4j -> publishes those tweets to a topic in Kafka -> Spark Streaming subscribes those tweets for processing. But when I run the code I am getting the exception - Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leaders for Set([LiveTweets,0]) The code is - import java.util.HashMap import java.util.Properties import twitter4j._ import twitter4j.FilterQuery; import twitter4j

What is best approach to join data in spark streaming application?

阅读更多关于 What is best approach to join data in spark streaming application?

问题 Question : Essentially it means , rather than running a join of C* table for each streaming records , is there anyway to run a join for each micro-batch ( micro-batching ) of records in spark streaming ? We are almost finalized to use spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. But have one fundamental question regarding the efficiency in the below scenario. For the streaming data records(i.e. streamingDataSet ) , I need to look up for existing

how to properly use pyspark to send data to kafka broker?

阅读更多关于 how to properly use pyspark to send data to kafka broker?

问题 I'm trying to write a simple pyspark job, which would receive data from a kafka broker topic, did some transformation on that data, and put the transformed data on a different kafka broker topic. I have the following code, which reads data from a kafka topic, but has no effect running sendkafka function: from pyspark import SparkConf, SparkContext from operator import add import sys from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils import json from

Spark Streaming App stucks while writing and reading to/from Cassandra simultaneously

阅读更多关于 Spark Streaming App stucks while writing and reading to/from Cassandra simultaneously

问题 I was doing some benchmarking that consists of the following data flow: Kafka --> Spark Streaming --> Cassandra --> Prestodb Infrastructure : My spark streaming application runs on 4 executors (2 cores 4g of memory each). Each executor runs on a datanode wherein Cassandra is installed. 4 PrestoDB workers are also co-located in the datanodes. My cluster has 5 nodes, each of them with an Intel core i5, 32GB of DDR3 RAM, 500GB SSD and 1gigabit network. Spark streaming application : My Spark

How to integrate Ganglia for Spark 2.1 Job metrics, Spark ignoring Ganglia metrics

阅读更多关于 How to integrate Ganglia for Spark 2.1 Job metrics, Spark ignoring Ganglia metrics

问题 I am trying to integrate Spark 2.1 job's metrics to Ganglia. My spark-default.conf looks like *.sink.ganglia.class org.apache.spark.metrics.sink.GangliaSink *.sink.ganglia.name Name *.sink.ganglia.host $MASTERIP *.sink.ganglia.port $PORT *.sink.ganglia.mode unicast *.sink.ganglia.period 10 *.sink.ganglia.unit seconds When i submit my job i can see the warn Warning: Ignoring non-spark config property: *.sink.ganglia.host=host Warning: Ignoring non-spark config property: *.sink.ganglia.name

Kafka Stream to Spark Stream python

阅读更多关于 Kafka Stream to Spark Stream python

问题 We have Kafka stream which use Avro. I need to connect it to Spark Stream. I use bellow code as Lev G suggest. kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}, valueDecoder=MessageSerializer.decode_message) I got bellow error when i execute it through spark-submit. 2018-10-09 10:49:27 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove executor 12 for reason Container marked as failed: container_1537396420651_0008_01_000013 on

Reading avro messages from Kafka in spark streaming/structured streaming

阅读更多关于 Reading avro messages from Kafka in spark streaming/structured streaming

问题 I am using pyspark for the first time. Spark Version : 2.3.0 Kafka Version : 2.2.0 I have a kafka producer which sends nested data in avro format and I am trying to write code in spark-streaming/ structured streaming in pyspark which will deserialize the avro coming from kafka into dataframe do transformations write it in parquet format into s3. I was able to find avro converters in spark/scala but support in pyspark has not yet been added. How do I convert the same in pyspark. Thanks. 回答1:

Online (incremental) logistic regression in Spark [duplicate]

阅读更多关于 Online (incremental) logistic regression in Spark [duplicate]

问题 This question already has answers here : Whether we can update existing model in spark-ml/spark-mllib? (2 answers) Closed 11 months ago . In Spark MLlib (RDD-based API) there is the StreamingLogisticRegressionWithSGD for incremental training of a Logistic Regression model. However, this class has been deprecated and offers little functionality (eg no access to model coefficients and output probabilities). In Spark ML (DataFrame-based API) I only find the class LogisticRegression , having only

Running in “deadlock” while doing streaming aggregations from Kafka

阅读更多关于 Running in “deadlock” while doing streaming aggregations from Kafka

问题 I posted another question with a similar regards a few days ago: How to load history data when starting Spark Streaming process, and calculate running aggregations I managed to get at least a "working" solution now, meaning that the process itself seems to work correctly. But, as I am a bloody beginner concerning Spark, I seem to have missed some things on how to build these kind of applications in a correct way (performance-/computational-wise)... What I want to do: Load history data from

Running in “deadlock” while doing streaming aggregations from Kafka

阅读更多关于 Running in “deadlock” while doing streaming aggregations from Kafka