apache-spark-2.0 | 易学教程

Dynamic Allocation for Spark Streaming

阅读更多关于 Dynamic Allocation for Spark Streaming

问题 I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark Streaming(in 1.6.1 version). But is Fixed in 2.0.0 JIRA link According to the PDF in this issue, it says there should be a configuration field called spark.streaming.dynamicAllocation.enabled=true But I dont see this configuration in the documentation.

Timeout Exception in Apache-Spark during program Execution

阅读更多关于 Timeout Exception in Apache-Spark during program Execution

问题 I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop. The code exits with the following exception after running a small number of iterations, around 3000 iterations. org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout

java.lang.IllegalStateException: Error reading delta file, spark structured streaming with kafka

阅读更多关于 java.lang.IllegalStateException: Error reading delta file, spark structured streaming with kafka

问题 I am using Structured Streaming + Kafka for realtime data analytics in our project. I am using Spark 2.2, kafka 0.10.2. I am facing an issue during streaming query recovery from checkpoint at application startup. As there are multiple streaming queries derived from a single kafka streaming point and there are different checkpint directories for every streaming query. So in case of job failure, when we restart the job there are some streaming queries which fails to recover from checkpoint

Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]

阅读更多关于 Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]

问题 The problem is the same as described here Error when starting spark-shell local on Mac ... but I have failed to find a solution. I also used to get the malformed URI error but now I get expected hostname. So when I am not connected to internet, spark shell fails to load in local mode [See the error below]. So I am running Apache Spark 2.1.0 downloaded from internet, running on my Mac. So I run ./bin/spark-shell and it gives me the error below. So I have read the Spark code and it is using

Why does using cache on streaming Datasets fail with “AnalysisException: Queries with streaming sources must be executed with writeStream.start()”?

阅读更多关于 Why does using cache on streaming Datasets fail with “AnalysisException: Queries with streaming sources must be executed with writeStream.start()”?

问题 SparkSession .builder .master("local[*]") .config("spark.sql.warehouse.dir", "C:/tmp/spark") .config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint") .appName("my-test") .getOrCreate .readStream .schema(schema) .json("src/test/data") .cache .writeStream .start .awaitTermination While executing this sample in Spark 2.1.0 I got error. Without the .cache option it worked as intended but with .cache option i got: Exception in thread "main" org.apache.spark.sql

Reading csv files with quoted fields containing embedded commas

阅读更多关于 Reading csv files with quoted fields containing embedded commas

问题 I am reading a csv file in Pyspark as follows: df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ? I know pandas can handle this, but can Spark ? The version I am using is Spark 2.0.0. Here is an example which works in Pandas but fails using Spark: In [1]: import pandas as pd In [2]: pdf = pd.read_csv('malformed_data.csv') In [3]: sdf=spark.read

How to convert RDD of dense vector into DataFrame in pyspark?

阅读更多关于 How to convert RDD of dense vector into DataFrame in pyspark?

问题 I have a DenseVector RDD like this >>> frequencyDenseVectors.collect() [DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])] I want to convert this into a Dataframe . I tried like this >>> spark.createDataFrame

Spark 2.0 Dataset vs DataFrame

阅读更多关于 Spark 2.0 Dataset vs DataFrame

问题 starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: What is the difference between df.select("foo") df.select($"foo") do I understand correctly that myDataSet.map(foo.someVal) is typesafe and will not convert into RDD but stay in DataSet representation / no additional overhead (performance wise for 2.0.0) all the other commands e.g. select, .. are just syntactic sugar. They are not typesafe and a map could be used

PySpark Streaming process failed with await termination

阅读更多关于 PySpark Streaming process failed with await termination

问题 Here is the Streaming code which I run, after running for two days, it stops automatically did I miss something? def streaming_setup(): stream = StreamingContext(sc.sparkContext, 10) stream.checkpoint(config['checkpointPath']) lines_data = stream.textFileStream(monitor_directory) lines_data.foreachRDD(persist_file) return stream Spark Streaming session started here, ssc = StreamingContext.getOrCreate(config['checkpointPath'], lambda: streaming_setup()) ssc = streaming_setup() ssc.start() ssc

Specifiying custom profilers for pyspark running Spark 2.0

阅读更多关于 Specifiying custom profilers for pyspark running Spark 2.0

问题 I would like to know how to specify a custom profiler class in PySpark for Spark version 2+. Under 1.6, I know I can do so like this: sc = SparkContext('local', 'test', profiler_cls='MyProfiler') but when I create the SparkSession in 2.0 I don't explicitly have access to the SparkContext . Can someone please advise how to do this for Spark 2.0+ ? 回答1: SparkSession can be initialized with an existing SparkContext , for example: from pyspark import SparkContext from pyspark.sql import