spark-streaming

how to connect spark streaming with cassandra?

一曲冷凌霜 提交于 2020-01-01 15:32:07
问题 I'm using Cassandra v2.1.12 Spark v1.4.1 Scala 2.10 and cassandra is listening on rpc_address:127.0.1.1 rpc_port:9160 For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job sc = SparkContext(conf=conf) stream=StreamingContext(sc,4) map1={'topic_name':1} kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1) And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.

why I only can see one spark streaming kafkaReceiver

ⅰ亾dé卋堺 提交于 2020-01-01 12:41:31
问题 I'm confused why I only can see one KafkaReceiver in spark web UI page(8080), But I do have 10 partitions in Kafka, and I used 10 cores in spark cluster, also my code as follows in python: kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer",{topic: 10}) I suppose the KafkaReceivers number should be 10 rather than 1. I’m so confused. thank you in advance! 回答1: kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer",{topic: 10}) That code creates 1 receiver

why I only can see one spark streaming kafkaReceiver

好久不见. 提交于 2020-01-01 12:41:06
问题 I'm confused why I only can see one KafkaReceiver in spark web UI page(8080), But I do have 10 partitions in Kafka, and I used 10 cores in spark cluster, also my code as follows in python: kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer",{topic: 10}) I suppose the KafkaReceivers number should be 10 rather than 1. I’m so confused. thank you in advance! 回答1: kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer",{topic: 10}) That code creates 1 receiver

Change output file name in Spark Streaming

让人想犯罪 __ 提交于 2020-01-01 12:10:49
问题 I am running a Spark job which performs extremely well as far as the logic goes. However, the name of my output files are in the format part-00000,part-00001 etc., when I use saveAsTextFile to save the files in a s3 bucket. Is there a way to change the output filename? Thank you. 回答1: In Spark, you can use saveAsNewAPIHadoopFile and set mapreduce.output.basename parameter in hadoop configuration to change the prefix (Just the "part" prefix) val hadoopConf = new Configuration() hadoopConf.set(

com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.5.3

六月ゝ 毕业季﹏ 提交于 2020-01-01 09:22:38
问题 My OS is OS X 10.11.6. I'm running Spark 2.0, Zeppelin 0.6, Scala 2.11 When I run this code in Zeppelin I get an exception from Jackson. When I run this code in spark-shell - no exception. val filestream = ssc.textFileStream("/Users/davidlaxer/first-edition/ch06") com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.5.3 at com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:56) at com.fasterxml.jackson.module.scala

Spark Streamming : Reading data from kafka that has multiple schema

五迷三道 提交于 2020-01-01 06:32:33
问题 I am struggling with the implementation in spark streaming. The messages from the kafka looks like this but with with more fields {"event":"sensordata", "source":"sensors", "payload": {"actual data as a json}} {"event":"databasedata", "mysql":"sensors", "payload": {"actual data as a json}} {"event":"eventApi", "source":"event1", "payload": {"actual data as a json}} {"event":"eventapi", "source":"event2", "payload": {"actual data as a json}} I am trying to read the messages from a Kafka topic

Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set

久未见 提交于 2019-12-31 13:15:34
问题 I'm trying to setup Spark Streaming to get messages from Kafka queue. I'm getting the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o30.createDirectStream. : org.apache.spark.SparkException: java.nio.channels.ClosedChannelException org.apache.spark.SparkException: Couldn't find leader offsets for Set([test-topic,0]) at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366) at org.apache.spark.streaming.kafka

Yarn : yarn-site.xml changes not taking effect

点点圈 提交于 2019-12-31 04:13:07
问题 We have a spark streaming application running on HDFS 2.7.3 with Yarn as the resource manager....while running the application .. these two folders /tmp/hadoop/data/nm-local-dir/filecache /tmp/hadoop/data/nm-local-dir/filecache are filling up and hence the disk ..... so from my research found that configuring these two properties in yarn-site.xml will help <property> <name>yarn.nodemanager.localizer.cache.cleanup.interval-ms</name> <value>2000</value> </property> <property> <name>yarn

Convert an Spark dataframe columns with an array of JSON objects to multiple rows

纵饮孤独 提交于 2019-12-31 04:06:54
问题 I have a streaming JSON data, whose structure can be described with the case class below case class Hello(A: String, B: Array[Map[String, String]]) Sample data for the same is as below | A | B | |-------|------------------------------------------| | ABC | [{C:1, D:1}, {C:2, D:4}] | | XYZ | [{C:3, D :6}, {C:9, D:11}, {C:5, D:12}] | I want to transform it to | A | C | D | |-------|-----|------| | ABC | 1 | 1 | | ABC | 2 | 4 | | XYZ | 3 | 6 | | XYZ | 9 | 11 | | XYZ | 5 | 12 | Any help will be

Spark streaming StreamingContext.start() - Error starting receiver 0

混江龙づ霸主 提交于 2019-12-31 03:32:26
问题 I have a project that's using spark streaming and I'm running it with 'spark-submit' but I'm hitting this error: 15/01/14 10:34:18 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.AbstractMethodError at org.apache.spark.Logging$class.log(Logging.scala:52) at org.apache.spark.streaming.kafka.KafkaReceiver.log(KafkaInputDStream.scala:66) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.streaming.kafka.KafkaReceiver