apache-spark-1.3 | 易学教程

running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

阅读更多关于 running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

问题 this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark). EDIT please see my update edits at the bottom of this post I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY slow saves of a dataframe to parquet or csv, or converting a dataframe to a list or array type data structure. Below is the relevant python/pyspark code and info: #Table is a List of Rows from small Hive table I loaded using #query = "SELECT *

How to get an Iterator of Rows using Dataframe in SparkSQL

阅读更多关于 How to get an Iterator of Rows using Dataframe in SparkSQL

问题 I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list. Note: I am executing this SparkSQL application using yarn-client 回答1: Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you

How to get an Iterator of Rows using Dataframe in SparkSQL

阅读更多关于 How to get an Iterator of Rows using Dataframe in SparkSQL

Spark 1.3.0: ExecutorLostFailure depending on input file size

阅读更多关于 Spark 1.3.0: ExecutorLostFailure depending on input file size

问题 I'm trying to run a simple python application on a 2-node-cluster I set up in standalone mode. A master and a worker, whereas the master also takes on the role of a worker. In the following code I'm trying to count the number of cakes occurring in a 500MB text file and it fails with a ExecutorLostFailure. Interestingly the application runs through if I take a 100MB input file. I used the package version of CDH5.4.4 with YARN and I'm running Spark 1.3.0. Each node has 8GB of memory and these

Spark 1.3.0: ExecutorLostFailure depending on input file size

阅读更多关于 Spark 1.3.0: ExecutorLostFailure depending on input file size

I'm trying to run a simple python application on a 2-node-cluster I set up in standalone mode. A master and a worker, whereas the master also takes on the role of a worker. In the following code I'm trying to count the number of cakes occurring in a 500MB text file and it fails with a ExecutorLostFailure. Interestingly the application runs through if I take a 100MB input file. I used the package version of CDH5.4.4 with YARN and I'm running Spark 1.3.0. Each node has 8GB of memory and these are some of my configurations: executor memory: 4g driver memory: 2g number of cores per worker: 1

Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming

阅读更多关于 Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming

问题 We are looking forward to implement a use case using Spark Streaming (with flume) and Spark SQL with windowing that allows us to perform CEP calculation over a set of data.(See below for how the data is captured and used). The idea is to use SQL to perform some action which matches certain conditions. . Executing the query based on each incoming event batch seems to be very slow (As it progresses). Here slow means say I have configured window size of 600 Sec and Batch interval of 20 Sec.

How to view the logs of a spark job after it has completed and the context is closed?

阅读更多关于 How to view the logs of a spark job after it has completed and the context is closed?

问题 I am running pyspark , spark 1.3 , standalone mode , client mode . I am trying to investigate my spark job by looking at the jobs from the past and comparing them. I want to view their logs, the configuration settings under which the jobs were submitted, etc. But I'm running into trouble viewing the logs of jobs after the context is closed. When I submit a job, of course I open a spark context. While the job is running, I'm able to open the spark web UI using ssh tunneling. And, I can access

Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming

阅读更多关于 Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming

We are looking forward to implement a use case using Spark Streaming (with flume) and Spark SQL with windowing that allows us to perform CEP calculation over a set of data.(See below for how the data is captured and used). The idea is to use SQL to perform some action which matches certain conditions. . Executing the query based on each incoming event batch seems to be very slow (As it progresses). Here slow means say I have configured window size of 600 Sec and Batch interval of 20 Sec. (pumping the data with speed of 1 input per 2 second) So say at the time after 10 min where incoming input

GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

阅读更多关于 GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

问题 I have a Hive table in parquet format that was generated using create table myTable (var1 int, var2 string, var3 int, var4 string, var5 array<struct<a:int,b:string>>) stored as parquet; I am able to verify that it was filled -- here is a sample value [1, "abcdef", 2, "ghijkl", ArrayBuffer([1, "hello"])] I wish to put this into a Spark RDD of the form ((1,"abcdef"), ((2,"ghijkl"), Set((1,"hello")))) Now, using spark-shell (I get the same problem in spark-submit), I made a test RDD with these

GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

阅读更多关于 GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

I have a Hive table in parquet format that was generated using create table myTable (var1 int, var2 string, var3 int, var4 string, var5 array<struct<a:int,b:string>>) stored as parquet; I am able to verify that it was filled -- here is a sample value [1, "abcdef", 2, "ghijkl", ArrayBuffer([1, "hello"])] I wish to put this into a Spark RDD of the form ((1,"abcdef"), ((2,"ghijkl"), Set((1,"hello")))) Now, using spark-shell (I get the same problem in spark-submit), I made a test RDD with these values scala> val tempRDD = sc.parallelize(Seq(((1,"abcdef"),((2,"ghijkl"), ArrayBuffer[(Int,String)]((1