pyspark-sql | 易学教程

How to left out join two big tables effectively

阅读更多关于 How to left out join two big tables effectively

问题 I have two tables, table_a and table_b, table_a contains 216646500 rows, 7155998163 bytes; table_b contains 1462775 rows, 2096277141 bytes table_a's schema is: c_1, c_2, c_3, c_4 ; table_b's schema is: c_2, c_5, c_6, ... (about 10 columns) I want to do a left_outer join the two tables on the same key col_2, but it has run for 16 hours and hasn't finished yet... The pyspark code is as follow: combine_table = table_a.join(table_b, table_a.col_2 == table_b.col_2, 'left_outer').collect() Is there

How to plot using matplotlib and pandas in pyspark environment?

阅读更多关于 How to plot using matplotlib and pandas in pyspark environment?

问题 I have a very large pyspark dataframe and I took a sample and convert it into pandas dataframe sample = heavy_pivot.sample(False, fraction = 0.2, seed = None) sample_pd = sample.toPandas() The dataframe looks like this: sample_pd[['client_id', 'beer_freq']].head(10) client_id beer_freq 0 1000839 0.000000 1 1002185 0.000000 2 1003366 1.000000 3 1005218 1.000000 4 1005483 1.000000 5 100964 0.434783 6 101272 0.166667 7 1017462 0.000000 8 1020561 0.000000 9 1023646 0.000000 I want to plot a

Column values to dynamically define struct

阅读更多关于 Column values to dynamically define struct

问题 I have two nested arrays one is strings the other are floats. I would like to essentially zip this up and have one (value, var) combo per row. I was trying to do it with just a dataframe and not have to resort to rdds or udfs thinking that this would be cleaner and faster. I can turn array of values, variables per row into a struct of a value, variable, 1-per-row, but because my array sizes vary I have to run my array comprehension over different ranges. So I thought I could just specify the

Combining multiple groupBy functions into 1

阅读更多关于 Combining multiple groupBy functions into 1

问题 Using this code to find modal : import numpy as np np.random.seed(1) df2 = sc.parallelize([ (int(x), ) for x in np.random.randint(50, size=10000) ]).toDF(["x"]) cnts = df2.groupBy("x").count() mode = cnts.join( cnts.agg(max("count").alias("max_")), col("count") == col("max_") ).limit(1).select("x") mode.first()[0] from Calculate the mode of a PySpark DataFrame column? returns error : --------------------------------------------------------------------------- AttributeError Traceback (most

How to load only first n files in pyspark spark.read.csv from a single directory

阅读更多关于 How to load only first n files in pyspark spark.read.csv from a single directory

问题 I have a scenario where I am loading and processing 4TB of data, which is about 15000 .csv files in a folder. since I have limited resources, I am planning to process them in two batches and them union them. I am trying to understand if I can load only 50% (or first n number of files in batch1 and the rest in batch 2) using spark.read.csv. I can not use a regular expression as these files are generated from multiple sources and they are of uneven number(from some sources they are few and from

Can we use keyword arguments in UDF

阅读更多关于 Can we use keyword arguments in UDF

问题 Question I have is can we we use keyword arguments along with UDF in Pyspark as I did below. conv method has a keyword argument conv_type which by default is assigned to a specific type of formatter however I want to specify a different format at some places. Which is not getting through in udf because of keyword argument. Is there a different approach of using keyword argument here? from datetime import datetime as dt, timedelta as td,date tpid_date_dict = {'69': '%d/%m/%Y', '62': '%Y/%m/%d'

How to set pivotMaxValues in pyspark?

阅读更多关于 How to set pivotMaxValues in pyspark?

问题 I am trying to pivot a column which has more than 10000 distinct values. The default limit in Spark for maximum number of distinct values is 10000 and I am receiving this error The pivot column COLUMN_NUM_2 has more than 10000 distinct values, this could indicate an error. If this was intended, set spark.sql.pivotMaxValues to at least the number of distinct values of the pivot column How do I set this in PySpark? 回答1: You have to add / set this parameter in the Spark interpreter. I am working

combining multiple rows in Spark dataframe column based on condition

阅读更多关于 combining multiple rows in Spark dataframe column based on condition

问题 I am trying to combine multiple rows in a spark dataframe based on a condition: This is the dataframe I have(df): |username | qid | row_no | text | --------------------------------- | a | 1 | 1 | this | | a | 1 | 2 | is | | d | 2 | 1 | the | | a | 1 | 3 | text | | d | 2 | 2 | ball | I want it to look like this |username | qid | row_no | text | --------------------------------------- | a | 1 | 1,2,3 | This is text| | b | 2 | 1,2 | The ball | I am using spark 1.5.2 it does not have collect_list

Create a column in a PySpark dataframe using a list whose indices are present in one column of the dataframe

阅读更多关于 Create a column in a PySpark dataframe using a list whose indices are present in one column of the dataframe

问题 I'm new to Python and PySpark. I have a dataframe in PySpark like the following: ## +---+---+------+ ## | x1| x2| x3 | ## +---+---+------+ ## | 0| a | 13.0| ## | 2| B | -33.0| ## | 1| B | -63.0| ## +---+---+------+ I have an array: arr = [10, 12, 13] I want to create a column x4 in the dataframe such that it should have the corresponding values from the list based on the values of x1 as indices. The final dataset should look like: ## +---+---+------+-----+ ## | x1| x2| x3 | x4 | ## +---+---+-

Can I use Spark DataFrame inside regular Spark map operation?

阅读更多关于 Can I use Spark DataFrame inside regular Spark map operation?

问题 I tried to use defined before Spark DataFrame from a regular Spark map operation like below: businessJSON = os.path.join(targetDir, 'business.json') businessDF = sqlContext.read.json(businessJSON) reviewsJSON = os.path.join(targetDir, 'review.json') reviewsDF = sqlContext.read.json(reviewsJSON) contains = udf(lambda xs, val: val in xs, BooleanType()) def selectReviews(category): businessesByCategory = businessDF[contains(businessDF.categories, lit(category))] selectedReviewsDF = reviewsDF