spark-streaming

Spark doesnt print outputs on the console within the map function

只谈情不闲聊 提交于 2019-12-31 01:56:10
问题 I have a simple Spark application running on cluster mode. val funcGSSNFilterHeader = (x: String) => { println(!x.contains("servedMSISDN") !x.contains("servedMSISDN") } val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) val ggsnFileLines = ssc.fileStream[LongWritable, Text, TextInputFormat]("C:\\Users\\Mbazarganigilani\\Documents\\RA\\GGSN\\Files1", filterF, false) val ggsnArrays = ggsnFileLines .map(x => x._2.toString()).filter(x => funcGSSNFilterHeader(x)) ggsnArrays

Spark stateful streaming job hangs at checkpointing to S3 after long uptime

坚强是说给别人听的谎言 提交于 2019-12-31 00:41:09
问题 I've been recently stress testing our Spark Streaming app. The stress testing ingests about 20,000 messages/sec with message sizes varying between 200bytes - 1K into Kafka, where Spark Streaming is reading batches every 4 seconds. Our Spark cluster runs on version 1.6.1 with Standalone cluster manager, and we're using Scala 2.10.6 for our code. After about a 15-20 hour run, one of the executors which is initiating a checkpoint (done at a 40 second interval) is stuck with the following stack

How can I write results of JavaPairDStream into output kafka topic on Spark Streaming?

ε祈祈猫儿з 提交于 2019-12-30 07:41:42
问题 I'm looking for a way to write a Dstream in an output kafka topic, only when the micro-batch RDDs spit out something. I'm using Spark Streaming and spark-streaming-kafka connector in Java8 (both latest versions) I cannot figure out. Thanks for the help. 回答1: if dStream contains data that you want to send to Kafka: dStream.foreachRDD(rdd -> { rdd.foreachPartition(iter ->{ Producer producer = createKafkaProducer(); while (iter.hasNext()){ sendToKafka(producer, iter.next()) } } }); So, you

How can I write results of JavaPairDStream into output kafka topic on Spark Streaming?

青春壹個敷衍的年華 提交于 2019-12-30 07:41:10
问题 I'm looking for a way to write a Dstream in an output kafka topic, only when the micro-batch RDDs spit out something. I'm using Spark Streaming and spark-streaming-kafka connector in Java8 (both latest versions) I cannot figure out. Thanks for the help. 回答1: if dStream contains data that you want to send to Kafka: dStream.foreachRDD(rdd -> { rdd.foreachPartition(iter ->{ Producer producer = createKafkaProducer(); while (iter.hasNext()){ sendToKafka(producer, iter.next()) } } }); So, you

Process Spark Streaming rdd and store to single HDFS file

主宰稳场 提交于 2019-12-30 07:28:31
问题 I am using Kafka Spark Streaming to get streaming data. val lines = KafkaUtils.createDirectStream[Array[Byte], String, DefaultDecoder, StringDecoder](ssc, kafkaConf, Set(topic)).map(_._2) I am using this DStream and processing RDDs val output = lines.foreachRDD(rdd => rdd.foreachPartition { partition => partition.foreach { file => runConfigParser(file)} }) runConfigParser is a JAVA method which parses a file and produces an output which i have to save in HDFS. So multiple nodes will process

how to set and get static variables from spark?

元气小坏坏 提交于 2019-12-29 06:22:53
问题 I have a class as this: public class Test { private static String name; public static String getName() { return name; } public static void setName(String name) { Test.name = name; } public static void print() { System.out.println(name); } } in my Spark driver, I'm setting the name like this and calling the print() command: public final class TestDriver{ public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName("TestApp"); // ... // ... Test

how to set and get static variables from spark?

自作多情 提交于 2019-12-29 06:21:10
问题 I have a class as this: public class Test { private static String name; public static String getName() { return name; } public static void setName(String name) { Test.name = name; } public static void print() { System.out.println(name); } } in my Spark driver, I'm setting the name like this and calling the print() command: public final class TestDriver{ public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName("TestApp"); // ... // ... Test

How to fix “java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord” in Spark Streaming Kafka Consumer?

江枫思渺然 提交于 2019-12-28 14:57:09
问题 Spark 2.0.0 Apache Kafka 0.10.1.0 scala 2.11.8 When I use spark streaming and kafka integration with kafka broker version 0.10.1.0 with the following Scala code it fails with the following exception: 16/11/13 12:55:20 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord Serialization stack: - object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic =

reading json file in pyspark

一世执手 提交于 2019-12-28 03:11:05
问题 I'm new to PySpark, Below is my JSON file format from kafka. { "header": { "platform":"atm", "version":"2.0" } "details":[ { "abc":"3", "def":"4" }, { "abc":"5", "def":"6" }, { "abc":"7", "def":"8" } ] } how can I read through the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)] . The new list will be used to create a spark data frame. how can i do this in pyspark.I tried the below code. parsed = messages.map(lambda (k,v): json.loads(v))

how to call oracle stored proc in spark?

荒凉一梦 提交于 2019-12-26 08:21:54
问题 In my spark project , I am using spark-sql-2.4.1v. As part of my code , I need to call oracle stored procs in my spark job. how to call oracle stored procs? 回答1: You can try doing something like this, though I have never tried this personally in any implementation query = "exec SP_NAME" empDF = spark.read \ .format("jdbc") \ .option("url", "jdbc:oracle:thin:username/password@//hostname:portnumber/SID") \ .option("dbtable", query) \ .option("user", "db_user_name") \ .option("password",