spark-streaming

Spark - Non-time-based windows are not supported on streaming DataFrames/Datasets;

。_饼干妹妹 提交于 2019-12-22 10:57:50
问题 I need to write Spark sql query with inner select and partition by. Problem is that I have AnalysisException. I already spend few hours on this but with other approach I have no success. Exception: Exception in thread "main" org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; Window [sum(cast(_w0#41 as bigint)) windowspecdefinition(deviceId#28, timestamp#30 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS

How to load history data when starting Spark Streaming process, and calculate running aggregations

家住魔仙堡 提交于 2019-12-22 10:47:12
问题 I have some sales-related JSON data in my ElasticSearch cluster, and I would like to use Spark Streaming (using Spark 1.4.1) to dynamically aggregate incoming sales events from my eCommerce website via Kafka, to have a current view to the user's total sales (in terms of revenue and products). What's not really clear to me from the docs I read is how I can load the history data from ElasticSearch upon the start of the Spark application, and to calculate for example the overall revenue per user

Spark serialization error: When I insert Spark Stream data into HBase

主宰稳场 提交于 2019-12-22 10:03:54
问题 I'm confused about how spark interact with HBase in terms of data format. For instance, when I omitted the 'ERROR' line in the following code snippet, it runs well... but adding the line, I've caught the error saying 'Task not serializable' related to serialization issue. How do I change the code? What is the reason why the error happens? My code is following : // HBase Configuration hconfig = HBaseConfiguration.create(); hconfig.set("hbase.zookeeper.property.clientPort", "2222"); hconfig.set

Can a model be created on Spark batch and use it in Spark streaming?

佐手、 提交于 2019-12-22 08:07:05
问题 Can I create a model in spark batch and use it on Spark streaming for real-time processing? I have seen the various examples on Apache Spark site where both training and prediction are built on the same type of processing (linear regression). 回答1: Can I create a model in spark batch and use it on Spark streaming for real-time processing? Ofcourse, yes. In spark community they call it offline training online predictions. Many training algorithms in spark allow you to save the model on file

How to convert RDD to DataFrame in Spark Streaming, not just Spark

时间秒杀一切 提交于 2019-12-22 06:35:06
问题 How can I convert RDD to DataFrame in Spark Streaming , not just Spark ? I saw this example, but it requires SparkContext . val sqlContext = new SQLContext(sc) import sqlContext.implicits._ rdd.toDF() In my case I have StreamingContext . Should I then create SparkContext inside foreach ? It looks too crazy... So, how to deal with this issue? My final goal (if it might be useful) is to save the DataFrame in Amazon S3 using rdd.toDF.write.format("json").saveAsTextFile("s3://iiiii/ttttt.json");

How can I make Spark Streaming count the words in a file in a unit test?

随声附和 提交于 2019-12-22 06:04:44
问题 I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala. When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. I terminate the application by pressing Ctrl+C. Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the

Dynamic Allocation for Spark Streaming

无人久伴 提交于 2019-12-22 05:16:14
问题 I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark Streaming(in 1.6.1 version). But is Fixed in 2.0.0 JIRA link According to the PDF in this issue, it says there should be a configuration field called spark.streaming.dynamicAllocation.enabled=true But I dont see this configuration in the documentation.

Kafka + Spark Streaming: constant delay of 1 second

女生的网名这么多〃 提交于 2019-12-22 01:21:12
问题 EDIT2: Finally I have made my own producer using Java and it works well, so the problem is in the Kafka-console-producer . The kafka-console-consumer works well. EDIT: I have already tried with the version 0.9.0.1 and has the same behaviour. I am working on my bachelor's final project, a comparison between Spark Streaming and Flink. Before both frameworks I am using Kafka and a script to generate the data (explained below). My first test is to compare the latency between both frameworks with

Scala Spark Filter RDD using Cassandra

别来无恙 提交于 2019-12-22 01:12:04
问题 I am new to spark-Cassandra and Scala. I have an existing RDD. let say: ((url_hash, url, created_timestamp )). I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls. Cassandra Table looks like following: url_hash| url | created_timestamp | updated_timestamp Any pointers will be great. I tried something like this this: case class UrlInfoT(url_sha256: String, full_url: String,

Why is my Spark streaming app so slow?

♀尐吖头ヾ 提交于 2019-12-22 01:06:01
问题 I have a cluster with 4 nodes: 3 Spark nodes and 1 Solr node. My CPU is 8 core, my memory is 32 GB, disc space is SSD. I use cassandra as my database. My data amount is 22GB after 6 hours and I now have around 3,4 Million rows, which should be read in under 5 minutes. But already it can't complete the task in this amount of time. My future plan is to read 100 Million rows in under 5 minutes . I am not sure what I can increase or do better to achieve this result now as well as to achieve my