pyspark-sql

Difference between Caching mechanism in Spark SQL

守給你的承諾、 提交于 2019-12-22 11:23:15
问题 I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets: Method 1: cache table test_cache AS select a, b, c from x inner join y on x.a = y.a; Method 2: create temporary view test_cache AS select a, b, c from x inner join y on x.a = y.a; cache table test_cache; Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any

Difference between Caching mechanism in Spark SQL

我是研究僧i 提交于 2019-12-22 11:20:27
问题 I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets: Method 1: cache table test_cache AS select a, b, c from x inner join y on x.a = y.a; Method 2: create temporary view test_cache AS select a, b, c from x inner join y on x.a = y.a; cache table test_cache; Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any

Using Python's reduce() to join multiple PySpark DataFrames

人盡茶涼 提交于 2019-12-22 10:40:04
问题 Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error: def join_dataframes(list_of_join_columns, left_df, right_df): return left_df.join(right_df, on=list_of_join_columns) joined_df = functools.reduce( functools.partial(join_dataframes, list_of_join_columns), list_of

E-num / get Dummies in pyspark

雨燕双飞 提交于 2019-12-21 17:52:09
问题 I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list PFA the Before and After DF: before and After data frame- Example The code in python looks like that: enum = ['column1','column2'] for e in enum: print e temp = pd.get_dummies(data[e],drop_first=True,prefix=e) data = pd.concat([data,temp], axis=1) data.drop(e,axis=1,inplace

Question about joining dataframes in Spark

柔情痞子 提交于 2019-12-21 17:28:55
问题 Suppose I have two partitioned dataframes: df1 = spark.createDataFrame( [(x,x,x) for x in range(5)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') df2 = spark.createDataFrame( [(x,x,x) for x in range(7)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') (scenario 1) If I join them by [key1, key2] join operation is performed within each partition without shuffle (number of partitions in result dataframe is the same): x = df1.join(df2, on=['key1', 'key2'], how='left')

How to calculate rolling median in PySpark using Window()?

北城以北 提交于 2019-12-21 14:07:29
问题 How do I calculate rolling median of dollar for a window size of previous 3 values? Input data dollars timestampGMT 25 2017-03-18 11:27:18 17 2017-03-18 11:27:19 13 2017-03-18 11:27:20 27 2017-03-18 11:27:21 13 2017-03-18 11:27:22 43 2017-03-18 11:27:23 12 2017-03-18 11:27:24 Expected Output data dollars timestampGMT rolling_median_dollar 25 2017-03-18 11:27:18 median(25) 17 2017-03-18 11:27:19 median(17,25) 13 2017-03-18 11:27:20 median(13,17,25) 27 2017-03-18 11:27:21 median(27,13,17) 13

How to load CSV file with records on multiple lines?

白昼怎懂夜的黑 提交于 2019-12-21 12:03:09
问题 I use Spark 2.3.0. As a Apache Spark's project I am using this data set to work on. When trying to read csv using spark, row in spark dataframe does not corresponds to correct row in csv (See sample csv here) file. Code looks like following: answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True); answer_df.show(2) Output +--------------------+-------------+--------------------+--------+-----+--------------------+ | Id| OwnerUserId|

Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

久未见 提交于 2019-12-21 04:02:24
问题 I am trying to use a Spark cluster (running on AWS EMR) to link groups of items that have common elements in them. Essentially, I have groups with some elements and if some of the elements are in multiple groups, I want to make one group that contains elements from all of those groups. I know about GraphX library and I tried to use graphframes package ( ConnectedComponents algorithm) to resolve this task, but it seams that the graphframes package is not yet mature enough and is very wasteful

replace column values in spark dataframe based on dictionary similar to np.where

一个人想着一个人 提交于 2019-12-20 04:48:00
问题 My data frame looks like - no city amount 1 Kenora 56% 2 Sudbury 23% 3 Kenora 71% 4 Sudbury 41% 5 Kenora 33% 6 Niagara 22% 7 Hamilton 88% It consist of 92M records. I want my data frame looks like - no city amount new_city 1 Kenora 56% X 2 Niagara 23% X 3 Kenora 71% X 4 Sudbury 41% Sudbury 5 Ottawa 33% Ottawa 6 Niagara 22% X 7 Hamilton 88% Hamilton Using python I can manage it(using np.where ) but not getting any results in pyspark. Any help? I have done so far - #create dictionary city_dict

Read range of files in pySpark

主宰稳场 提交于 2019-12-20 03:32:24
问题 I need to read contiguous files in pySpark. The following works for me. from pyspark.sql import SQLContext file = "events.parquet/exportDay=2015090[1-7]" df = sqlContext.read.load(file) How do I read files 8-14? 回答1: Use curly braces. file = "events.parquet/exportDay=201509{08,09,10,11,12,13,14}" Here's a similar question on stack overflow: Pyspark select subset of files using regex glob. They suggest either using curly braces, OR performing multiple reads and then unioning the objects