spark-streaming

Access Spark broadcast variable in different classes

懵懂的女人 提交于 2020-02-27 08:23:06
问题 I am broadcasting a value in Spark Streaming application . But I am not sure how to access that variable in a different class than the class where it was broadcasted. My code looks as follows: object AppMain{ def main(args: Array[String]){ //... val broadcastA = sc.broadcast(a) //.. lines.foreachRDD(rdd => { val obj = AppObject1 rdd.filter(p => obj.apply(p)) rdd.count } } object AppObject1: Boolean{ def apply(str: String){ AnotherObject.process(str) } } object AnotherObject{ // I want to use

Spark Structured Streaming + Kafka Integration: MicroBatchExecution PartitionOffsets Error

萝らか妹 提交于 2020-02-21 06:10:26
问题 I am using Spark Structured Streaming to process the incoming and outgoing data streams from and to Apache Kafka respectively, using the scala code below. I can successfully read data stream using kafka source, however while trying to write stream to Kafka sink I am getting following error: ERROR MicroBatchExecution:91 - Query [id = 234750ca-d416-4182-b3cc-4e2c1f922724, runId = 4c4b0931-9876-456f-8d56-752623803332] terminated with error java.lang.IllegalArgumentException: Expected e.g. {

How to fix “org.apache.spark.shuffle.FetchFailedException: Failed to connect” in NetworkWordCount Spark Streaming application?

旧巷老猫 提交于 2020-02-03 09:35:12
问题 I try submit example Apache Spark Streaming application: /opt/spark/bin/spark-submit --class org.apache.spark.examples.streaming.NetworkWordCount --deploy-mode cluster --master yarn --driver-memory 2g --executor-memory 2g /opt/spark/examples/jars/spark-examples_2.11-2.0.0.jar 172.29.74.68 9999 As parameters I type master IP and local port (in another console is running: nc -lk 9999 ). And always I get error: WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 50, iws1): FetchFailed

Unable to deserialize ActorRef to send result to different Actor

自作多情 提交于 2020-02-02 04:30:10
问题 I am starting to use Spark Streaming to process a real time data feed I am getting. My scenario is I have a Akka actor receiver using "with ActorHelper", then I have my Spark job doing some mappings and transformation and then I want to send the result to another actor. My issue is the last part. When trying to send to another actor Spark is raising an exception: 15/02/20 16:43:16 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.IllegalStateException: Trying to

How does unbound table work in spark structured streaming

落花浮王杯 提交于 2020-01-30 08:58:08
问题 Take word count for example, when the application startup and long runs, and receive a word "Spark" , then in the result table, there is a row (Spark,1), After the application has been running for 1 day or even one week, the application receives "Spark" again, so that the result table should have a row (spark,2). I am just using above scenario to raise the question: How the unbounded table keeps the state of the data it receives,since the state could be super huge after the application runs

How does unbound table work in spark structured streaming

半腔热情 提交于 2020-01-30 08:57:24
问题 Take word count for example, when the application startup and long runs, and receive a word "Spark" , then in the result table, there is a row (Spark,1), After the application has been running for 1 day or even one week, the application receives "Spark" again, so that the result table should have a row (spark,2). I am just using above scenario to raise the question: How the unbounded table keeps the state of the data it receives,since the state could be super huge after the application runs

Spark FileStreaming issue

社会主义新天地 提交于 2020-01-30 08:25:32
问题 I am trying simple file streaming example using Sparkstreaming(spark-streaming_2.10,version:1.5.1) public class DStreamExample { public static void main(final String[] args) { final SparkConf sparkConf = new SparkConf(); sparkConf.setAppName("SparkJob"); sparkConf.setMaster("local[4]"); // for local final JavaSparkContext sc = new JavaSparkContext(sparkConf); final JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000)); final JavaDStream<String> lines = ssc.textFileStream

Batch lookup data for Spark streaming

时间秒杀一切 提交于 2020-01-30 08:13:11
问题 I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job. Is there a " design pattern " for such a task? how can I reload the data in memory (a hashmap) immediately after a daily update? how to serve the streaming job continously while this lookup data is being fetched? 回答1: One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream : val

Spark Stream - 'utf8' codec can't decode bytes

这一生的挚爱 提交于 2020-01-25 09:07:05
问题 I'm fairly new to stream programming. We have Kafka stream which use Avro. I want to connect a Kafka Stream to Spark Stream. I used bellow code. kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}) lines = kvs.map(lambda x: x[1]) I got bellow error. return s.decode('utf-8') File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-58:

How to control processing of spark-stream while there is no data in Kafka topic

此生再无相见时 提交于 2020-01-25 06:48:50
问题 I am using spark-sql 2.4.1 , spark-cassandra-connector_2.11-2.4.1.jar and java8. I have cassandra table like this: CREATE company(company_id int, start_date date, company_name text, PRIMARY_KEY (company_id, start_date)) WITH CLUSTERING ORDER BY (start_date DESC); The field start_date here is a derived field, which is calculated in the business logic. I have spark-sql streaming code in which I call below mapFunction. public static MapFunction<Company, CompanyTransformed> mapFunInsertCompany =