spark-streaming | 易学教程

how to connect spark streaming with cassandra?

阅读更多关于 how to connect spark streaming with cassandra?

问题 I'm using Cassandra v2.1.12 Spark v1.4.1 Scala 2.10 and cassandra is listening on rpc_address:127.0.1.1 rpc_port:9160 For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job sc = SparkContext(conf=conf) stream=StreamingContext(sc,4) map1={'topic_name':1} kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1) And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.

why I only can see one spark streaming kafkaReceiver

阅读更多关于 why I only can see one spark streaming kafkaReceiver

问题 I'm confused why I only can see one KafkaReceiver in spark web UI page(8080), But I do have 10 partitions in Kafka, and I used 10 cores in spark cluster, also my code as follows in python: kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer",{topic: 10}) I suppose the KafkaReceivers number should be 10 rather than 1. I’m so confused. thank you in advance! 回答1: kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer",{topic: 10}) That code creates 1 receiver

why I only can see one spark streaming kafkaReceiver

阅读更多关于 why I only can see one spark streaming kafkaReceiver

Change output file name in Spark Streaming

阅读更多关于 Change output file name in Spark Streaming

问题 I am running a Spark job which performs extremely well as far as the logic goes. However, the name of my output files are in the format part-00000,part-00001 etc., when I use saveAsTextFile to save the files in a s3 bucket. Is there a way to change the output filename? Thank you. 回答1: In Spark, you can use saveAsNewAPIHadoopFile and set mapreduce.output.basename parameter in hadoop configuration to change the prefix (Just the "part" prefix) val hadoopConf = new Configuration() hadoopConf.set(

com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.5.3

阅读更多关于 com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.5.3

问题 My OS is OS X 10.11.6. I'm running Spark 2.0, Zeppelin 0.6, Scala 2.11 When I run this code in Zeppelin I get an exception from Jackson. When I run this code in spark-shell - no exception. val filestream = ssc.textFileStream("/Users/davidlaxer/first-edition/ch06") com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.5.3 at com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:56) at com.fasterxml.jackson.module.scala

Spark Streamming : Reading data from kafka that has multiple schema

阅读更多关于 Spark Streamming : Reading data from kafka that has multiple schema

问题 I am struggling with the implementation in spark streaming. The messages from the kafka looks like this but with with more fields {"event":"sensordata", "source":"sensors", "payload": {"actual data as a json}} {"event":"databasedata", "mysql":"sensors", "payload": {"actual data as a json}} {"event":"eventApi", "source":"event1", "payload": {"actual data as a json}} {"event":"eventapi", "source":"event2", "payload": {"actual data as a json}} I am trying to read the messages from a Kafka topic

Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set

阅读更多关于 Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set

问题 I'm trying to setup Spark Streaming to get messages from Kafka queue. I'm getting the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o30.createDirectStream. : org.apache.spark.SparkException: java.nio.channels.ClosedChannelException org.apache.spark.SparkException: Couldn't find leader offsets for Set([test-topic,0]) at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366) at org.apache.spark.streaming.kafka

Yarn : yarn-site.xml changes not taking effect

阅读更多关于 Yarn : yarn-site.xml changes not taking effect

问题 We have a spark streaming application running on HDFS 2.7.3 with Yarn as the resource manager....while running the application .. these two folders /tmp/hadoop/data/nm-local-dir/filecache /tmp/hadoop/data/nm-local-dir/filecache are filling up and hence the disk ..... so from my research found that configuring these two properties in yarn-site.xml will help <property> <name>yarn.nodemanager.localizer.cache.cleanup.interval-ms</name> <value>2000</value> </property> <property> <name>yarn

Convert an Spark dataframe columns with an array of JSON objects to multiple rows

阅读更多关于 Convert an Spark dataframe columns with an array of JSON objects to multiple rows

问题 I have a streaming JSON data, whose structure can be described with the case class below case class Hello(A: String, B: Array[Map[String, String]]) Sample data for the same is as below | A | B | |-------|------------------------------------------| | ABC | [{C:1, D:1}, {C:2, D:4}] | | XYZ | [{C:3, D :6}, {C:9, D:11}, {C:5, D:12}] | I want to transform it to | A | C | D | |-------|-----|------| | ABC | 1 | 1 | | ABC | 2 | 4 | | XYZ | 3 | 6 | | XYZ | 9 | 11 | | XYZ | 5 | 12 | Any help will be

Spark streaming StreamingContext.start() - Error starting receiver 0

阅读更多关于 Spark streaming StreamingContext.start() - Error starting receiver 0

问题 I have a project that's using spark streaming and I'm running it with 'spark-submit' but I'm hitting this error: 15/01/14 10:34:18 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.AbstractMethodError at org.apache.spark.Logging$class.log(Logging.scala:52) at org.apache.spark.streaming.kafka.KafkaReceiver.log(KafkaInputDStream.scala:66) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.streaming.kafka.KafkaReceiver