spark-streaming

Using Spark SQL with Spark Streaming

妖精的绣舞 提交于 2019-12-11 05:38:40
问题 Trying make sense of SparkSql with respect to Spark Structured Streaming. Spark Session reads events from a kafka topic, aggregates data to counts grouped by different column names and prints it to the console. Raw input data structured like this: +--------------+--------------------+----------+----------+-------+-------------------+--------------------+----------+ |. sourceTypes| Guid| platform|datacenter|pagesId| eventTimestamp| Id1234| Id567890| +--------------+--------------------+-------

Transform DStream RDD using external data

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 05:31:37
问题 We are developing a spark streaming ETL application that will source the data from Kafka, apply necessary transformations and load the data into MongoDB. The data received from Kafka is in JSON format. The transformations are applied to each element(JSON String) of the RDD based on the lookup data fetched from MongoDB. Since the lookup data changes, I need to fetch it for every batch interval. The lookup data is read using SqlContext.read from MongoDB. I was not able to use SqlContext.read

Metrics System not recognizing Custom Source/Sink in application jar

一个人想着一个人 提交于 2019-12-11 05:29:48
问题 Followup from here. I've added Custom Source and Sink in my application jar and found a way to get a static fixed metrics.properties on Stand-alone cluster nodes. When I want to launch my application, I give the static path - spark.metrics.conf="/fixed-path/to/metrics.properties". Despite my custom source/sink being in my code/fat-jar - I get ClassNotFoundException on CustomSink. My fat-jar (with Custom Source/Sink code in it) is on hdfs with read access to all. So here's what all I've

spark throws java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition

做~自己de王妃 提交于 2019-12-11 05:19:44
问题 when I use spark-submit command in Cloudera Yarn environment, I got this kind of exception: java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.getDeclaredMethods(Class.java:1975) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.com$fasterxml$jackson$module$scala$introspect$BeanIntrospector$$listMethods$1(BeanIntrospector.scala:93)

Spark Checkpoint doesn't remember state (Java HDFS)

女生的网名这么多〃 提交于 2019-12-11 05:15:23
问题 ALready Looked at Spark streaming not remembering previous state but doesn't help. Also looked at http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing but cant find JavaStreamingContextFactory although I am using spark streaming 2.11 v 2.0.1 My code works fine but when I restart it... it won't remember the last checkpoint... Function0<JavaStreamingContext> scFunction = new Function0<JavaStreamingContext>() { @Override public JavaStreamingContext call() throws

Why number of executors shown in Spark Web UI and rest UI is constant always , when dynamic allocation is enabled

倖福魔咒の 提交于 2019-12-11 05:14:36
问题 I am running a Spark Streaming application with batch duration of 1 minute, and I have dynamic allocation enabled with executorIdleRunTime equal to 5 seconds. I can see in event timeline that they about 4000 many executors allocated and removed but in executors I always see 102 number of executors.What can be the reason of that.Rest API gives the same data but executor is there is different. I have attached the respective snapshots for better understanding of the question. 来源: https:/

Saving the data from SparkStreaming Workers to Database

萝らか妹 提交于 2019-12-11 05:07:54
问题 In SparkStreaming should we off load the saving part to another layer because SparkStreaming context is not available when we use SparkCassandraConnector if our database is cassandra. Moreover, even if we use some other database to save our data then we need to create connection on the worker every time we process a batch of rdds. Reason being connection objects are not serialized. Is it recommended to create/close connections at workers? It would make our system tightly coupled with the

Spark streaming 2.0.0 - freezes after several days under load

半世苍凉 提交于 2019-12-11 04:26:37
问题 We are running on AWS EMR 5.0.0 with Spark 2.0.0. Consuming from a 125 shard Kinesis stream. Feeding 19k events/s using 2 message producers, each message about 1k in size. Consuming using a cluster of 20 machines. The code has a flatMap(), groupByKey(), persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then storing to s3 using foreachRDD(); Using backpressure and Kryo: sparkConf.set("spark.streaming.backpressure.enabled", "true"); sparkConf.set("spark.serializer", "org.apache

Can SparkContext.textFile be used with a custom receiver?

扶醉桌前 提交于 2019-12-11 04:23:27
问题 I'm trying to implement a Streaming job that uses a custom receiver to read messages from SQS. Each message contains a single reference to an S3 file which I would then like to read, parse, and store as ORC. Here is the code I have so far: val sc = new SparkContext(conf) val streamContext = new StreamingContext(sc, Seconds(5)) val sqs = streamContext.receiverStream(new SQSReceiver("events-elb") .credentials("accessKey", "secretKey") .at(Regions.US_EAST_1) .withTimeout(5)) val s3File = sqs.map

Create a DataFrame in Spark Stream

痴心易碎 提交于 2019-12-11 04:22:49
问题 I've connected the Kafka Stream to the Spark. As well as I've trained Apache Spark Mlib model to prediction based on a streamed text. My problem is, get a prediction I need to pass a DataFramework. //kafka stream val stream = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) //load mlib model val model = PipelineModel.load(modelPath) stream.foreachRDD { rdd => rdd.foreach { record => //to get a prediction need to pass DF val