spark-streaming

Exception org.apache.spark.rdd.RDD[(scala.collection.immutable.Map[String,Any], Int)] in scala/spark

别等时光非礼了梦想. 提交于 2019-12-12 01:49:55
问题 Using Below code I am getting tweets for a particular filter : val topCounts60 = tweetMap.map((_, 1)). reduceByKeyAndWindow(_+_, Seconds(60*60)) one of the sample Output of topCounts60 is in below format if i do topCounts60.println(): (Map(UserLang -> en, UserName -> Harmeet Singh, UserScreenName -> harmeetsingh060, HashTags -> , UserVerification -> false, Spam -> true, UserFollowersCount -> 44, UserLocation -> भारत, UserStatusCount -> 50, UserCreated -> 2016-07-04T06:32:49.000+0530,

How to count new element from stream by using spark-streaming

狂风中的少年 提交于 2019-12-12 01:39:47
问题 I have done implementation of daily compute. Here is some pseudo-code. "newUser" may called first activated user. // Get today log from hbase or somewhere else val log = getRddFromHbase(todayDate) // Compute active user val activeUser = log.map(line => ((line.uid, line.appId), line).reduceByKey(distinctStrategyMethod) // Get history user from hdfs val historyUser = loadFromHdfs(path + yesterdayDate) // Compute new user from active user and historyUser val newUser = activeUser.subtractByKey

Spark Streaming to Cassandra, not persisiting

て烟熏妆下的殇ゞ 提交于 2019-12-12 01:38:10
问题 I am trying to persist spark stream in to Cassandra, here is my code: JavaDStream<BusinessPointNYCT> studentFileDStream = m_JavaStreamingContext.textFileStream(new File(fileDir, "BUSINESSPOINTS_NY_CT.csv").getAbsolutePath()).map(new BusinessPointMapFunction()); //Save it to Cassandra CassandraStreamingJavaUtil.javaFunctions(studentFileDStream) .writerBuilder("spatial_keyspace", "businesspoints_ny_ct", mapToRow(BusinessPointNYCT.class)).saveToCassandra(); My application is started without any

How to use math.sqrt for DStream[(Double,Double)]?

一曲冷凌霜 提交于 2019-12-12 01:36:21
问题 For the streaming data DStream[(Double, Double)] , how do I estimate the root mean squared error? See my code below. The line math.sqrt(summse) is where I have a problem (the code does not compile): def calculateRMSE(output: DStream[(Double, Double)], n: DStream[Long]): Double = { val summse = output.foreachRDD { rdd => rdd.map { case pair: (Double, Double) => val err = math.abs(pair._1 - pair._2); err*err }.reduce(_ + _) } math.sqrt(summse) } UPDATE: The code doesn't compile: Cannot resolve

Avoid write files for empty partitions in Spark Streaming

无人久伴 提交于 2019-12-12 01:26:19
问题 I have Spark Streaming job which reads data from kafka partitions (one executor per partition). I need to save transformed values to HDFS, but need to avoid empty files creation. I tried to use isEmpty, but this doesn't help when not all partitions are empty. P.S. repartition is not an acceptable solution due to perfomance degradation. 回答1: The code works for PairRDD only. Code for text: val conf = ssc.sparkContext.hadoopConfiguration conf.setClass("mapreduce.output.lazyoutputformat

java.lang.ClassNotFoundException: java.time.temporal.TemporalField when running Spark code

痴心易碎 提交于 2019-12-12 01:20:36
问题 This question is related to the previous thread. I am extracting sessions from the stream of users' click events. For validation purposes, I am always waiting on a timeout of 2 minutes, and if the user was inactive during these 2 minutes (no click events), then I assume that the session was finished. These finished sessions should be saved in finishedSessions . The below-given code provides the error (see below). settings = ssc.sparkContext.broadcast(Map( "metadataBrokerList_OutputQueue" ->

PySpark Streaming process failed with await termination

丶灬走出姿态 提交于 2019-12-12 01:16:17
问题 Here is the Streaming code which I run, after running for two days, it stops automatically did I miss something? def streaming_setup(): stream = StreamingContext(sc.sparkContext, 10) stream.checkpoint(config['checkpointPath']) lines_data = stream.textFileStream(monitor_directory) lines_data.foreachRDD(persist_file) return stream Spark Streaming session started here, ssc = StreamingContext.getOrCreate(config['checkpointPath'], lambda: streaming_setup()) ssc = streaming_setup() ssc.start() ssc

spark streaming not able to use spark sql

孤者浪人 提交于 2019-12-12 01:09:48
问题 I am facing an issue during spark streaming. I am getting empty records after it gets streamed and passes to the "parse" method. My code: import spark.implicits._ import org.apache.spark.sql.types._ import org.apache.spark.sql.Encoders import org.apache.spark.streaming._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession import spark.implicits._ import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} import org.apache.spark.sql

Spark DStream from Kafka always starts at beginning

一个人想着一个人 提交于 2019-12-11 21:44:03
问题 Look at my last comment of the accepted answer for the solution I configured a DStream like so: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "kafka1.example.com:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[KafkaAvroDeserializer], "group.id" -> "mygroup", "specific.avro.reader" -> true, "schema.registry.url" -> "http://schema.example.com:8081" ) val stream = KafkaUtils.createDirectStream( ssc, PreferConsistent, Subscribe[String,

Found nothing in _spark_metadata

爷,独闯天下 提交于 2019-12-11 21:36:34
问题 I am trying to read CSV files from a specific folder and write same contents to other CSV file in a different location on the local pc for learning purpose. I can read the file and show the contents on the console. However, if I want to write it to another CSV file at the specified output directory I get a folder named "_spark_metadata" which contain nothing inside. I paste the whole code here step by step: creating Spark Session: spark = SparkSession \ .builder \ .appName('csv01') \ .master(