spark-streaming | 易学教程

Spark doesnt print outputs on the console within the map function

阅读更多关于 Spark doesnt print outputs on the console within the map function

问题 I have a simple Spark application running on cluster mode. val funcGSSNFilterHeader = (x: String) => { println(!x.contains("servedMSISDN") !x.contains("servedMSISDN") } val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds)) val ggsnFileLines = ssc.fileStream[LongWritable, Text, TextInputFormat]("C:\\Users\\Mbazarganigilani\\Documents\\RA\\GGSN\\Files1", filterF, false) val ggsnArrays = ggsnFileLines .map(x => x._2.toString()).filter(x => funcGSSNFilterHeader(x)) ggsnArrays

Spark stateful streaming job hangs at checkpointing to S3 after long uptime

阅读更多关于 Spark stateful streaming job hangs at checkpointing to S3 after long uptime

问题 I've been recently stress testing our Spark Streaming app. The stress testing ingests about 20,000 messages/sec with message sizes varying between 200bytes - 1K into Kafka, where Spark Streaming is reading batches every 4 seconds. Our Spark cluster runs on version 1.6.1 with Standalone cluster manager, and we're using Scala 2.10.6 for our code. After about a 15-20 hour run, one of the executors which is initiating a checkpoint (done at a 40 second interval) is stuck with the following stack

How can I write results of JavaPairDStream into output kafka topic on Spark Streaming?

阅读更多关于 How can I write results of JavaPairDStream into output kafka topic on Spark Streaming?

问题 I'm looking for a way to write a Dstream in an output kafka topic, only when the micro-batch RDDs spit out something. I'm using Spark Streaming and spark-streaming-kafka connector in Java8 (both latest versions) I cannot figure out. Thanks for the help. 回答1: if dStream contains data that you want to send to Kafka: dStream.foreachRDD(rdd -> { rdd.foreachPartition(iter ->{ Producer producer = createKafkaProducer(); while (iter.hasNext()){ sendToKafka(producer, iter.next()) } } }); So, you

How can I write results of JavaPairDStream into output kafka topic on Spark Streaming?

阅读更多关于 How can I write results of JavaPairDStream into output kafka topic on Spark Streaming?

Process Spark Streaming rdd and store to single HDFS file

阅读更多关于 Process Spark Streaming rdd and store to single HDFS file

问题 I am using Kafka Spark Streaming to get streaming data. val lines = KafkaUtils.createDirectStream[Array[Byte], String, DefaultDecoder, StringDecoder](ssc, kafkaConf, Set(topic)).map(_._2) I am using this DStream and processing RDDs val output = lines.foreachRDD(rdd => rdd.foreachPartition { partition => partition.foreach { file => runConfigParser(file)} }) runConfigParser is a JAVA method which parses a file and produces an output which i have to save in HDFS. So multiple nodes will process

how to set and get static variables from spark?

阅读更多关于 how to set and get static variables from spark?

问题 I have a class as this: public class Test { private static String name; public static String getName() { return name; } public static void setName(String name) { Test.name = name; } public static void print() { System.out.println(name); } } in my Spark driver, I'm setting the name like this and calling the print() command: public final class TestDriver{ public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName("TestApp"); // ... // ... Test

how to set and get static variables from spark?

阅读更多关于 how to set and get static variables from spark?

How to fix “java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord” in Spark Streaming Kafka Consumer?

阅读更多关于 How to fix “java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord” in Spark Streaming Kafka Consumer?

问题 Spark 2.0.0 Apache Kafka 0.10.1.0 scala 2.11.8 When I use spark streaming and kafka integration with kafka broker version 0.10.1.0 with the following Scala code it fails with the following exception: 16/11/13 12:55:20 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord Serialization stack: - object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic =

reading json file in pyspark

阅读更多关于 reading json file in pyspark

问题 I'm new to PySpark, Below is my JSON file format from kafka. { "header": { "platform":"atm", "version":"2.0" } "details":[ { "abc":"3", "def":"4" }, { "abc":"5", "def":"6" }, { "abc":"7", "def":"8" } ] } how can I read through the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)] . The new list will be used to create a spark data frame. how can i do this in pyspark.I tried the below code. parsed = messages.map(lambda (k,v): json.loads(v))

how to call oracle stored proc in spark?

阅读更多关于 how to call oracle stored proc in spark?

问题 In my spark project , I am using spark-sql-2.4.1v. As part of my code , I need to call oracle stored procs in my spark job. how to call oracle stored procs? 回答1: You can try doing something like this, though I have never tried this personally in any implementation query = "exec SP_NAME" empDF = spark.read \ .format("jdbc") \ .option("url", "jdbc:oracle:thin:username/password@//hostname:portnumber/SID") \ .option("dbtable", query) \ .option("user", "db_user_name") \ .option("password",