spark-streaming | 易学教程

Spark - Non-time-based windows are not supported on streaming DataFrames/Datasets;

阅读更多关于 Spark - Non-time-based windows are not supported on streaming DataFrames/Datasets;

问题 I need to write Spark sql query with inner select and partition by. Problem is that I have AnalysisException. I already spend few hours on this but with other approach I have no success. Exception: Exception in thread "main" org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; Window [sum(cast(_w0#41 as bigint)) windowspecdefinition(deviceId#28, timestamp#30 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS

How to load history data when starting Spark Streaming process, and calculate running aggregations

阅读更多关于 How to load history data when starting Spark Streaming process, and calculate running aggregations

问题 I have some sales-related JSON data in my ElasticSearch cluster, and I would like to use Spark Streaming (using Spark 1.4.1) to dynamically aggregate incoming sales events from my eCommerce website via Kafka, to have a current view to the user's total sales (in terms of revenue and products). What's not really clear to me from the docs I read is how I can load the history data from ElasticSearch upon the start of the Spark application, and to calculate for example the overall revenue per user

Spark serialization error: When I insert Spark Stream data into HBase

阅读更多关于 Spark serialization error: When I insert Spark Stream data into HBase

问题 I'm confused about how spark interact with HBase in terms of data format. For instance, when I omitted the 'ERROR' line in the following code snippet, it runs well... but adding the line, I've caught the error saying 'Task not serializable' related to serialization issue. How do I change the code? What is the reason why the error happens? My code is following : // HBase Configuration hconfig = HBaseConfiguration.create(); hconfig.set("hbase.zookeeper.property.clientPort", "2222"); hconfig.set

Can a model be created on Spark batch and use it in Spark streaming?

阅读更多关于 Can a model be created on Spark batch and use it in Spark streaming?

问题 Can I create a model in spark batch and use it on Spark streaming for real-time processing? I have seen the various examples on Apache Spark site where both training and prediction are built on the same type of processing (linear regression). 回答1: Can I create a model in spark batch and use it on Spark streaming for real-time processing? Ofcourse, yes. In spark community they call it offline training online predictions. Many training algorithms in spark allow you to save the model on file

How to convert RDD to DataFrame in Spark Streaming, not just Spark

阅读更多关于 How to convert RDD to DataFrame in Spark Streaming, not just Spark

问题 How can I convert RDD to DataFrame in Spark Streaming , not just Spark ? I saw this example, but it requires SparkContext . val sqlContext = new SQLContext(sc) import sqlContext.implicits._ rdd.toDF() In my case I have StreamingContext . Should I then create SparkContext inside foreach ? It looks too crazy... So, how to deal with this issue? My final goal (if it might be useful) is to save the DataFrame in Amazon S3 using rdd.toDF.write.format("json").saveAsTextFile("s3://iiiii/ttttt.json");

How can I make Spark Streaming count the words in a file in a unit test?

阅读更多关于 How can I make Spark Streaming count the words in a file in a unit test?

问题 I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala. When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. I terminate the application by pressing Ctrl+C. Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the

Dynamic Allocation for Spark Streaming

阅读更多关于 Dynamic Allocation for Spark Streaming

问题 I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark Streaming(in 1.6.1 version). But is Fixed in 2.0.0 JIRA link According to the PDF in this issue, it says there should be a configuration field called spark.streaming.dynamicAllocation.enabled=true But I dont see this configuration in the documentation.

Kafka + Spark Streaming: constant delay of 1 second

阅读更多关于 Kafka + Spark Streaming: constant delay of 1 second

问题 EDIT2: Finally I have made my own producer using Java and it works well, so the problem is in the Kafka-console-producer . The kafka-console-consumer works well. EDIT: I have already tried with the version 0.9.0.1 and has the same behaviour. I am working on my bachelor's final project, a comparison between Spark Streaming and Flink. Before both frameworks I am using Kafka and a script to generate the data (explained below). My first test is to compare the latency between both frameworks with

Scala Spark Filter RDD using Cassandra

阅读更多关于 Scala Spark Filter RDD using Cassandra

问题 I am new to spark-Cassandra and Scala. I have an existing RDD. let say: ((url_hash, url, created_timestamp )). I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls. Cassandra Table looks like following: url_hash| url | created_timestamp | updated_timestamp Any pointers will be great. I tried something like this this: case class UrlInfoT(url_sha256: String, full_url: String,

Why is my Spark streaming app so slow?

阅读更多关于 Why is my Spark streaming app so slow?

问题 I have a cluster with 4 nodes: 3 Spark nodes and 1 Solr node. My CPU is 8 core, my memory is 32 GB, disc space is SSD. I use cassandra as my database. My data amount is 22GB after 6 hours and I now have around 3,4 Million rows, which should be read in under 5 minutes. But already it can't complete the task in this amount of time. My future plan is to read 100 Million rows in under 5 minutes . I am not sure what I can increase or do better to achieve this result now as well as to achieve my