spark-streaming | 易学教程

Access Spark broadcast variable in different classes

阅读更多关于 Access Spark broadcast variable in different classes

问题 I am broadcasting a value in Spark Streaming application . But I am not sure how to access that variable in a different class than the class where it was broadcasted. My code looks as follows: object AppMain{ def main(args: Array[String]){ //... val broadcastA = sc.broadcast(a) //.. lines.foreachRDD(rdd => { val obj = AppObject1 rdd.filter(p => obj.apply(p)) rdd.count } } object AppObject1: Boolean{ def apply(str: String){ AnotherObject.process(str) } } object AnotherObject{ // I want to use

Spark Structured Streaming + Kafka Integration: MicroBatchExecution PartitionOffsets Error

阅读更多关于 Spark Structured Streaming + Kafka Integration: MicroBatchExecution PartitionOffsets Error

问题 I am using Spark Structured Streaming to process the incoming and outgoing data streams from and to Apache Kafka respectively, using the scala code below. I can successfully read data stream using kafka source, however while trying to write stream to Kafka sink I am getting following error: ERROR MicroBatchExecution:91 - Query [id = 234750ca-d416-4182-b3cc-4e2c1f922724, runId = 4c4b0931-9876-456f-8d56-752623803332] terminated with error java.lang.IllegalArgumentException: Expected e.g. {

How to fix “org.apache.spark.shuffle.FetchFailedException: Failed to connect” in NetworkWordCount Spark Streaming application?

阅读更多关于 How to fix “org.apache.spark.shuffle.FetchFailedException: Failed to connect” in NetworkWordCount Spark Streaming application?

问题 I try submit example Apache Spark Streaming application: /opt/spark/bin/spark-submit --class org.apache.spark.examples.streaming.NetworkWordCount --deploy-mode cluster --master yarn --driver-memory 2g --executor-memory 2g /opt/spark/examples/jars/spark-examples_2.11-2.0.0.jar 172.29.74.68 9999 As parameters I type master IP and local port (in another console is running: nc -lk 9999 ). And always I get error: WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 50, iws1): FetchFailed

Unable to deserialize ActorRef to send result to different Actor

阅读更多关于 Unable to deserialize ActorRef to send result to different Actor

问题 I am starting to use Spark Streaming to process a real time data feed I am getting. My scenario is I have a Akka actor receiver using "with ActorHelper", then I have my Spark job doing some mappings and transformation and then I want to send the result to another actor. My issue is the last part. When trying to send to another actor Spark is raising an exception: 15/02/20 16:43:16 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.IllegalStateException: Trying to

How does unbound table work in spark structured streaming

阅读更多关于 How does unbound table work in spark structured streaming

问题 Take word count for example, when the application startup and long runs, and receive a word "Spark" , then in the result table, there is a row (Spark,1), After the application has been running for 1 day or even one week, the application receives "Spark" again, so that the result table should have a row (spark,2). I am just using above scenario to raise the question: How the unbounded table keeps the state of the data it receives,since the state could be super huge after the application runs

How does unbound table work in spark structured streaming

阅读更多关于 How does unbound table work in spark structured streaming

Spark FileStreaming issue

阅读更多关于 Spark FileStreaming issue

问题 I am trying simple file streaming example using Sparkstreaming(spark-streaming_2.10,version:1.5.1) public class DStreamExample { public static void main(final String[] args) { final SparkConf sparkConf = new SparkConf(); sparkConf.setAppName("SparkJob"); sparkConf.setMaster("local[4]"); // for local final JavaSparkContext sc = new JavaSparkContext(sparkConf); final JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000)); final JavaDStream<String> lines = ssc.textFileStream

Batch lookup data for Spark streaming

阅读更多关于 Batch lookup data for Spark streaming

问题 I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job. Is there a " design pattern " for such a task? how can I reload the data in memory (a hashmap) immediately after a daily update? how to serve the streaming job continously while this lookup data is being fetched? 回答1: One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream : val

Spark Stream - 'utf8' codec can't decode bytes

阅读更多关于 Spark Stream - 'utf8' codec can't decode bytes

问题 I'm fairly new to stream programming. We have Kafka stream which use Avro. I want to connect a Kafka Stream to Spark Stream. I used bellow code. kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}) lines = kvs.map(lambda x: x[1]) I got bellow error. return s.decode('utf-8') File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-58:

How to control processing of spark-stream while there is no data in Kafka topic

阅读更多关于 How to control processing of spark-stream while there is no data in Kafka topic

问题 I am using spark-sql 2.4.1 , spark-cassandra-connector_2.11-2.4.1.jar and java8. I have cassandra table like this: CREATE company(company_id int, start_date date, company_name text, PRIMARY_KEY (company_id, start_date)) WITH CLUSTERING ORDER BY (start_date DESC); The field start_date here is a derived field, which is calculated in the business logic. I have spark-sql streaming code in which I call below mapFunction. public static MapFunction<Company, CompanyTransformed> mapFunInsertCompany =