pyspark-sql | 易学教程

Difference between Caching mechanism in Spark SQL

阅读更多关于 Difference between Caching mechanism in Spark SQL

问题 I am trying to wrap my head around various caching mechanisms in Spark SQL. Is there any difference between the following code snippets: Method 1: cache table test_cache AS select a, b, c from x inner join y on x.a = y.a; Method 2: create temporary view test_cache AS select a, b, c from x inner join y on x.a = y.a; cache table test_cache; Since computations in Spark are Lazy, will Spark cache the results the very first time the temp table is created in Method 2 ? Or will it wait for any

Difference between Caching mechanism in Spark SQL

阅读更多关于 Difference between Caching mechanism in Spark SQL

Using Python's reduce() to join multiple PySpark DataFrames

阅读更多关于 Using Python's reduce() to join multiple PySpark DataFrames

问题 Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? Specifically, this gives a massive slowdown followed by an out-of-memory error: def join_dataframes(list_of_join_columns, left_df, right_df): return left_df.join(right_df, on=list_of_join_columns) joined_df = functools.reduce( functools.partial(join_dataframes, list_of_join_columns), list_of

E-num / get Dummies in pyspark

阅读更多关于 E-num / get Dummies in pyspark

问题 I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list PFA the Before and After DF: before and After data frame- Example The code in python looks like that: enum = ['column1','column2'] for e in enum: print e temp = pd.get_dummies(data[e],drop_first=True,prefix=e) data = pd.concat([data,temp], axis=1) data.drop(e,axis=1,inplace

Question about joining dataframes in Spark

阅读更多关于 Question about joining dataframes in Spark

问题 Suppose I have two partitioned dataframes: df1 = spark.createDataFrame( [(x,x,x) for x in range(5)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') df2 = spark.createDataFrame( [(x,x,x) for x in range(7)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') (scenario 1) If I join them by [key1, key2] join operation is performed within each partition without shuffle (number of partitions in result dataframe is the same): x = df1.join(df2, on=['key1', 'key2'], how='left')

How to calculate rolling median in PySpark using Window()?

阅读更多关于 How to calculate rolling median in PySpark using Window()?

问题 How do I calculate rolling median of dollar for a window size of previous 3 values? Input data dollars timestampGMT 25 2017-03-18 11:27:18 17 2017-03-18 11:27:19 13 2017-03-18 11:27:20 27 2017-03-18 11:27:21 13 2017-03-18 11:27:22 43 2017-03-18 11:27:23 12 2017-03-18 11:27:24 Expected Output data dollars timestampGMT rolling_median_dollar 25 2017-03-18 11:27:18 median(25) 17 2017-03-18 11:27:19 median(17,25) 13 2017-03-18 11:27:20 median(13,17,25) 27 2017-03-18 11:27:21 median(27,13,17) 13

How to load CSV file with records on multiple lines?

阅读更多关于 How to load CSV file with records on multiple lines?

问题 I use Spark 2.3.0. As a Apache Spark's project I am using this data set to work on. When trying to read csv using spark, row in spark dataframe does not corresponds to correct row in csv (See sample csv here) file. Code looks like following: answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True); answer_df.show(2) Output +--------------------+-------------+--------------------+--------+-----+--------------------+ | Id| OwnerUserId|

Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

阅读更多关于 Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

问题 I am trying to use a Spark cluster (running on AWS EMR) to link groups of items that have common elements in them. Essentially, I have groups with some elements and if some of the elements are in multiple groups, I want to make one group that contains elements from all of those groups. I know about GraphX library and I tried to use graphframes package ( ConnectedComponents algorithm) to resolve this task, but it seams that the graphframes package is not yet mature enough and is very wasteful

replace column values in spark dataframe based on dictionary similar to np.where

阅读更多关于 replace column values in spark dataframe based on dictionary similar to np.where

问题 My data frame looks like - no city amount 1 Kenora 56% 2 Sudbury 23% 3 Kenora 71% 4 Sudbury 41% 5 Kenora 33% 6 Niagara 22% 7 Hamilton 88% It consist of 92M records. I want my data frame looks like - no city amount new_city 1 Kenora 56% X 2 Niagara 23% X 3 Kenora 71% X 4 Sudbury 41% Sudbury 5 Ottawa 33% Ottawa 6 Niagara 22% X 7 Hamilton 88% Hamilton Using python I can manage it(using np.where ) but not getting any results in pyspark. Any help? I have done so far - #create dictionary city_dict

Read range of files in pySpark

阅读更多关于 Read range of files in pySpark

问题 I need to read contiguous files in pySpark. The following works for me. from pyspark.sql import SQLContext file = "events.parquet/exportDay=2015090[1-7]" df = sqlContext.read.load(file) How do I read files 8-14? 回答1: Use curly braces. file = "events.parquet/exportDay=201509{08,09,10,11,12,13,14}" Here's a similar question on stack overflow: Pyspark select subset of files using regex glob. They suggest either using curly braces, OR performing multiple reads and then unioning the objects