pyspark-sql | 易学教程

looking if String contain a sub-string in differents Dataframes

阅读更多关于 looking if String contain a sub-string in differents Dataframes

问题 I have 2 dataframes: df_1, column id contain only characters and numbers ==> normalized, and id_no_normalized Example: id_normalized | id_no_normalized -------------|------------------- ABC | A_B.C -------------|------------------- ERFD | E.R_FD -------------|------------------- 12ZED | 12_Z.ED df_2, column name contain only characters and numbers ==> normalized are attached Example: name ---------------------------- googleisa12ZEDgoodnavigator ---------------------------- internetABCexplorer

PySpark Dataframe from Python Dictionary without Pandas

阅读更多关于 PySpark Dataframe from Python Dictionary without Pandas

问题 I am trying to convert the following Python dict into PySpark DataFrame but I am not getting expected output. dict_lst = {'letters': ['a', 'b', 'c'], 'numbers': [10, 20, 30]} df_dict = sc.parallelize([dict_lst]).toDF() # Result not as expected df_dict.show() Is there a way to do this without using Pandas? 回答1: Quoting myself: I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element

How to create a DataFrame out of rows while retaining existing schema?

阅读更多关于 How to create a DataFrame out of rows while retaining existing schema?

问题 If I call map or mapPartition and my function receives rows from PySpark what is the natural way to create either a local PySpark or Pandas DataFrame? Something that combines the rows and retains the schema? Currently I do something like: def combine(partition): rows = [x for x in partition] dfpart = pd.DataFrame(rows,columns=rows[0].keys()) pandafunc(dfpart) mydf.mapPartition(combine) 回答1: Spark >= 2.3.0 Since Spark 2.3.0 it is possible to use Pandas Series or DataFrame by partition or group

Trouble With Pyspark Round Function

阅读更多关于 Trouble With Pyspark Round Function

问题 Having some trouble getting the round function in pyspark to work - I have the below block of code, where I'm trying to round the new_bid column to 2 decimal places, and rename the column as bid afterwards - I'm importing pyspark.sql.functions AS func for reference, and using the round function contained within it: output = output.select(col("ad").alias("ad_id"), col("part").alias("part_id"), func.round(col("new_bid"), 2).alias("bid")) the new_bid column here is of type float - the resulting

PySpark: when function with multiple outputs [duplicate]

阅读更多关于 PySpark: when function with multiple outputs [duplicate]

问题 This question already has answers here : Spark Equivalent of IF Then ELSE (4 answers) Closed 2 years ago . I am trying to use a "chained when" function. In other words, I'd like to get more than two outputs. I tried using the same logic of the concatenate IF function in Excel: df.withColumn("device_id", when(col("device")=="desktop",1)).otherwise(when(col("device")=="mobile",2)).otherwise(null)) But that doesn't work since I can't put a tuple into the "otherwise" function. 回答1: Have you tried

How to TRUNCATE and / or use wildcards with Databrick

阅读更多关于 How to TRUNCATE and / or use wildcards with Databrick

问题 I'm trying to write a script in databricks that will select a file based on certain characters in the name of the file or just on the datestamp in the file. For example, the following file looks as follows: LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31 I have created the following code in Databricks: import datetime now1 = datetime.datetime.now() now = now1.strftime("%Y-%m-%d") Using the above code I tried to select the file using following: LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'

Spark - Window with recursion? - Conditionally propagating values across rows

阅读更多关于 Spark - Window with recursion? - Conditionally propagating values across rows

问题 I have the following dataframe showing the revenue of purchases. +-------+--------+-------+ |user_id|visit_id|revenue| +-------+--------+-------+ | 1| 1| 0| | 1| 2| 0| | 1| 3| 0| | 1| 4| 100| | 1| 5| 0| | 1| 6| 0| | 1| 7| 200| | 1| 8| 0| | 1| 9| 10| +-------+--------+-------+ Ultimately I want the new column purch_revenue to show the revenue generated by the purchase in every row. As a workaround, I have also tried to introduce a purchase identifier purch_id which is incremented each time a

Applying a Window function to calculate differences in pySpark

阅读更多关于 Applying a Window function to calculate differences in pySpark

问题 I am using pySpark , and have set up my dataframe with two columns representing a daily asset price as follows: ind = sc.parallelize(range(1,5)) prices = sc.parallelize([33.3,31.1,51.2,21.3]) data = ind.zip(prices) df = sqlCtx.createDataFrame(data,["day","price"]) I get upon applying df.show() : +---+-----+ |day|price| +---+-----+ | 1| 33.3| | 2| 31.1| | 3| 51.2| | 4| 21.3| +---+-----+ Which is fine and all. I would like to have another column that contains the day-to-day returns of the price

TypeError: Column is not iterable - How to iterate over ArrayType()?

阅读更多关于 TypeError: Column is not iterable - How to iterate over ArrayType()?

问题 Consider the following DataFrame: +------+-----------------------+ |type |names | +------+-----------------------+ |person|[john, sam, jane] | |pet |[whiskers, rover, fido]| +------+-----------------------+ Which can be created with the following code: import pyspark.sql.functions as f data = [ ('person', ['john', 'sam', 'jane']), ('pet', ['whiskers', 'rover', 'fido']) ] df = sqlCtx.createDataFrame(data, ["type", "names"]) df.show(truncate=False) Is there a way to directly modify the

effective way to groupby without using pivot in pyspark

阅读更多关于 effective way to groupby without using pivot in pyspark

问题 I have a query where I need to calculate memory utilization using pyspark. I had achieved this with python pandas using pivot but now I need to do it in pyspark and pivoting would be an expensive function so I would like to know if there is any alternative in pyspark for this solution time_stamp Hostname kpi kpi_subtype value_current 2019/08/17 10:01:05 Server1 memory Total 100 2019/08/17 10:01:06 Server1 memory used 35 2019/08/17 10:01:09 Server1 memory buffer 8 2019/08/17 10:02:04 Server1