apache-spark-1.3

running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

左心房为你撑大大i 提交于 2019-12-24 16:12:18
问题 this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark). EDIT please see my update edits at the bottom of this post I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY slow saves of a dataframe to parquet or csv, or converting a dataframe to a list or array type data structure. Below is the relevant python/pyspark code and info: #Table is a List of Rows from small Hive table I loaded using #query = "SELECT *

How to get an Iterator of Rows using Dataframe in SparkSQL

若如初见. 提交于 2019-12-18 05:57:08
问题 I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list. Note: I am executing this SparkSQL application using yarn-client 回答1: Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you

How to get an Iterator of Rows using Dataframe in SparkSQL

此生再无相见时 提交于 2019-12-18 05:57:02
问题 I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list. Note: I am executing this SparkSQL application using yarn-client 回答1: Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you

Spark 1.3.0: ExecutorLostFailure depending on input file size

谁说胖子不能爱 提交于 2019-12-10 10:32:47
问题 I'm trying to run a simple python application on a 2-node-cluster I set up in standalone mode. A master and a worker, whereas the master also takes on the role of a worker. In the following code I'm trying to count the number of cakes occurring in a 500MB text file and it fails with a ExecutorLostFailure. Interestingly the application runs through if I take a 100MB input file. I used the package version of CDH5.4.4 with YARN and I'm running Spark 1.3.0. Each node has 8GB of memory and these

Spark 1.3.0: ExecutorLostFailure depending on input file size

我与影子孤独终老i 提交于 2019-12-06 14:26:08
I'm trying to run a simple python application on a 2-node-cluster I set up in standalone mode. A master and a worker, whereas the master also takes on the role of a worker. In the following code I'm trying to count the number of cakes occurring in a 500MB text file and it fails with a ExecutorLostFailure. Interestingly the application runs through if I take a 100MB input file. I used the package version of CDH5.4.4 with YARN and I'm running Spark 1.3.0. Each node has 8GB of memory and these are some of my configurations: executor memory: 4g driver memory: 2g number of cores per worker: 1

Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming

妖精的绣舞 提交于 2019-12-06 07:19:50
问题 We are looking forward to implement a use case using Spark Streaming (with flume) and Spark SQL with windowing that allows us to perform CEP calculation over a set of data.(See below for how the data is captured and used). The idea is to use SQL to perform some action which matches certain conditions. . Executing the query based on each incoming event batch seems to be very slow (As it progresses). Here slow means say I have configured window size of 600 Sec and Batch interval of 20 Sec.

How to view the logs of a spark job after it has completed and the context is closed?

旧街凉风 提交于 2019-12-06 02:47:53
问题 I am running pyspark , spark 1.3 , standalone mode , client mode . I am trying to investigate my spark job by looking at the jobs from the past and comparing them. I want to view their logs, the configuration settings under which the jobs were submitted, etc. But I'm running into trouble viewing the logs of jobs after the context is closed. When I submit a job, of course I open a spark context. While the job is running, I'm able to open the spark web UI using ssh tunneling. And, I can access

Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming

梦想与她 提交于 2019-12-04 10:51:43
We are looking forward to implement a use case using Spark Streaming (with flume) and Spark SQL with windowing that allows us to perform CEP calculation over a set of data.(See below for how the data is captured and used). The idea is to use SQL to perform some action which matches certain conditions. . Executing the query based on each incoming event batch seems to be very slow (As it progresses). Here slow means say I have configured window size of 600 Sec and Batch interval of 20 Sec. (pumping the data with speed of 1 input per 2 second) So say at the time after 10 min where incoming input

GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

浪子不回头ぞ 提交于 2019-12-01 20:53:18
问题 I have a Hive table in parquet format that was generated using create table myTable (var1 int, var2 string, var3 int, var4 string, var5 array<struct<a:int,b:string>>) stored as parquet; I am able to verify that it was filled -- here is a sample value [1, "abcdef", 2, "ghijkl", ArrayBuffer([1, "hello"])] I wish to put this into a Spark RDD of the form ((1,"abcdef"), ((2,"ghijkl"), Set((1,"hello")))) Now, using spark-shell (I get the same problem in spark-submit), I made a test RDD with these

GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

五迷三道 提交于 2019-12-01 19:26:13
I have a Hive table in parquet format that was generated using create table myTable (var1 int, var2 string, var3 int, var4 string, var5 array<struct<a:int,b:string>>) stored as parquet; I am able to verify that it was filled -- here is a sample value [1, "abcdef", 2, "ghijkl", ArrayBuffer([1, "hello"])] I wish to put this into a Spark RDD of the form ((1,"abcdef"), ((2,"ghijkl"), Set((1,"hello")))) Now, using spark-shell (I get the same problem in spark-submit), I made a test RDD with these values scala> val tempRDD = sc.parallelize(Seq(((1,"abcdef"),((2,"ghijkl"), ArrayBuffer[(Int,String)]((1