apache-spark-2.0

Dynamic Allocation for Spark Streaming

无人久伴 提交于 2019-12-22 05:16:14
问题 I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark Streaming(in 1.6.1 version). But is Fixed in 2.0.0 JIRA link According to the PDF in this issue, it says there should be a configuration field called spark.streaming.dynamicAllocation.enabled=true But I dont see this configuration in the documentation.

Timeout Exception in Apache-Spark during program Execution

孤者浪人 提交于 2019-12-22 04:12:12
问题 I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop. The code exits with the following exception after running a small number of iterations, around 3000 iterations. org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout

java.lang.IllegalStateException: Error reading delta file, spark structured streaming with kafka

微笑、不失礼 提交于 2019-12-21 02:47:51
问题 I am using Structured Streaming + Kafka for realtime data analytics in our project. I am using Spark 2.2, kafka 0.10.2. I am facing an issue during streaming query recovery from checkpoint at application startup. As there are multiple streaming queries derived from a single kafka streaming point and there are different checkpint directories for every streaming query. So in case of job failure, when we restart the job there are some streaming queries which fails to recover from checkpoint

Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]

丶灬走出姿态 提交于 2019-12-20 22:07:13
问题 The problem is the same as described here Error when starting spark-shell local on Mac ... but I have failed to find a solution. I also used to get the malformed URI error but now I get expected hostname. So when I am not connected to internet, spark shell fails to load in local mode [See the error below]. So I am running Apache Spark 2.1.0 downloaded from internet, running on my Mac. So I run ./bin/spark-shell and it gives me the error below. So I have read the Spark code and it is using

Why does using cache on streaming Datasets fail with “AnalysisException: Queries with streaming sources must be executed with writeStream.start()”?

人走茶凉 提交于 2019-12-18 03:31:05
问题 SparkSession .builder .master("local[*]") .config("spark.sql.warehouse.dir", "C:/tmp/spark") .config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint") .appName("my-test") .getOrCreate .readStream .schema(schema) .json("src/test/data") .cache .writeStream .start .awaitTermination While executing this sample in Spark 2.1.0 I got error. Without the .cache option it worked as intended but with .cache option i got: Exception in thread "main" org.apache.spark.sql

Reading csv files with quoted fields containing embedded commas

纵饮孤独 提交于 2019-12-18 03:04:24
问题 I am reading a csv file in Pyspark as follows: df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ? I know pandas can handle this, but can Spark ? The version I am using is Spark 2.0.0. Here is an example which works in Pandas but fails using Spark: In [1]: import pandas as pd In [2]: pdf = pd.read_csv('malformed_data.csv') In [3]: sdf=spark.read

How to convert RDD of dense vector into DataFrame in pyspark?

江枫思渺然 提交于 2019-12-17 18:56:36
问题 I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])] I want to convert this into a Dataframe . I tried like this >>> spark.createDataFrame

Spark 2.0 Dataset vs DataFrame

核能气质少年 提交于 2019-12-16 20:15:32
问题 starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: What is the difference between df.select("foo") df.select($"foo") do I understand correctly that myDataSet.map(foo.someVal) is typesafe and will not convert into RDD but stay in DataSet representation / no additional overhead (performance wise for 2.0.0) all the other commands e.g. select, .. are just syntactic sugar. They are not typesafe and a map could be used

PySpark Streaming process failed with await termination

丶灬走出姿态 提交于 2019-12-12 01:16:17
问题 Here is the Streaming code which I run, after running for two days, it stops automatically did I miss something? def streaming_setup(): stream = StreamingContext(sc.sparkContext, 10) stream.checkpoint(config['checkpointPath']) lines_data = stream.textFileStream(monitor_directory) lines_data.foreachRDD(persist_file) return stream Spark Streaming session started here, ssc = StreamingContext.getOrCreate(config['checkpointPath'], lambda: streaming_setup()) ssc = streaming_setup() ssc.start() ssc

Specifiying custom profilers for pyspark running Spark 2.0

我怕爱的太早我们不能终老 提交于 2019-12-12 00:48:42
问题 I would like to know how to specify a custom profiler class in PySpark for Spark version 2+. Under 1.6, I know I can do so like this: sc = SparkContext('local', 'test', profiler_cls='MyProfiler') but when I create the SparkSession in 2.0 I don't explicitly have access to the SparkContext . Can someone please advise how to do this for Spark 2.0+ ? 回答1: SparkSession can be initialized with an existing SparkContext , for example: from pyspark import SparkContext from pyspark.sql import