pyspark-sql

looking if String contain a sub-string in differents Dataframes

旧街凉风 提交于 2019-12-18 07:14:46
问题 I have 2 dataframes: df_1, column id contain only characters and numbers ==> normalized, and id_no_normalized Example: id_normalized | id_no_normalized -------------|------------------- ABC | A_B.C -------------|------------------- ERFD | E.R_FD -------------|------------------- 12ZED | 12_Z.ED df_2, column name contain only characters and numbers ==> normalized are attached Example: name ---------------------------- googleisa12ZEDgoodnavigator ---------------------------- internetABCexplorer

PySpark Dataframe from Python Dictionary without Pandas

邮差的信 提交于 2019-12-18 06:49:49
问题 I am trying to convert the following Python dict into PySpark DataFrame but I am not getting expected output. dict_lst = {'letters': ['a', 'b', 'c'], 'numbers': [10, 20, 30]} df_dict = sc.parallelize([dict_lst]).toDF() # Result not as expected df_dict.show() Is there a way to do this without using Pandas? 回答1: Quoting myself: I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element

How to create a DataFrame out of rows while retaining existing schema?

老子叫甜甜 提交于 2019-12-18 05:24:07
问题 If I call map or mapPartition and my function receives rows from PySpark what is the natural way to create either a local PySpark or Pandas DataFrame? Something that combines the rows and retains the schema? Currently I do something like: def combine(partition): rows = [x for x in partition] dfpart = pd.DataFrame(rows,columns=rows[0].keys()) pandafunc(dfpart) mydf.mapPartition(combine) 回答1: Spark >= 2.3.0 Since Spark 2.3.0 it is possible to use Pandas Series or DataFrame by partition or group

Trouble With Pyspark Round Function

北慕城南 提交于 2019-12-18 04:48:07
问题 Having some trouble getting the round function in pyspark to work - I have the below block of code, where I'm trying to round the new_bid column to 2 decimal places, and rename the column as bid afterwards - I'm importing pyspark.sql.functions AS func for reference, and using the round function contained within it: output = output.select(col("ad").alias("ad_id"), col("part").alias("part_id"), func.round(col("new_bid"), 2).alias("bid")) the new_bid column here is of type float - the resulting

PySpark: when function with multiple outputs [duplicate]

谁说我不能喝 提交于 2019-12-18 04:46:06
问题 This question already has answers here : Spark Equivalent of IF Then ELSE (4 answers) Closed 2 years ago . I am trying to use a "chained when" function. In other words, I'd like to get more than two outputs. I tried using the same logic of the concatenate IF function in Excel: df.withColumn("device_id", when(col("device")=="desktop",1)).otherwise(when(col("device")=="mobile",2)).otherwise(null)) But that doesn't work since I can't put a tuple into the "otherwise" function. 回答1: Have you tried

How to TRUNCATE and / or use wildcards with Databrick

空扰寡人 提交于 2019-12-17 21:14:54
问题 I'm trying to write a script in databricks that will select a file based on certain characters in the name of the file or just on the datestamp in the file. For example, the following file looks as follows: LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31 I have created the following code in Databricks: import datetime now1 = datetime.datetime.now() now = now1.strftime("%Y-%m-%d") Using the above code I tried to select the file using following: LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'

Spark - Window with recursion? - Conditionally propagating values across rows

余生颓废 提交于 2019-12-17 16:54:32
问题 I have the following dataframe showing the revenue of purchases. +-------+--------+-------+ |user_id|visit_id|revenue| +-------+--------+-------+ | 1| 1| 0| | 1| 2| 0| | 1| 3| 0| | 1| 4| 100| | 1| 5| 0| | 1| 6| 0| | 1| 7| 200| | 1| 8| 0| | 1| 9| 10| +-------+--------+-------+ Ultimately I want the new column purch_revenue to show the revenue generated by the purchase in every row. As a workaround, I have also tried to introduce a purchase identifier purch_id which is incremented each time a

Applying a Window function to calculate differences in pySpark

蓝咒 提交于 2019-12-17 15:42:44
问题 I am using pySpark , and have set up my dataframe with two columns representing a daily asset price as follows: ind = sc.parallelize(range(1,5)) prices = sc.parallelize([33.3,31.1,51.2,21.3]) data = ind.zip(prices) df = sqlCtx.createDataFrame(data,["day","price"]) I get upon applying df.show() : +---+-----+ |day|price| +---+-----+ | 1| 33.3| | 2| 31.1| | 3| 51.2| | 4| 21.3| +---+-----+ Which is fine and all. I would like to have another column that contains the day-to-day returns of the price

TypeError: Column is not iterable - How to iterate over ArrayType()?

眉间皱痕 提交于 2019-12-17 04:07:09
问题 Consider the following DataFrame: +------+-----------------------+ |type |names | +------+-----------------------+ |person|[john, sam, jane] | |pet |[whiskers, rover, fido]| +------+-----------------------+ Which can be created with the following code: import pyspark.sql.functions as f data = [ ('person', ['john', 'sam', 'jane']), ('pet', ['whiskers', 'rover', 'fido']) ] df = sqlCtx.createDataFrame(data, ["type", "names"]) df.show(truncate=False) Is there a way to directly modify the

effective way to groupby without using pivot in pyspark

我与影子孤独终老i 提交于 2019-12-14 02:26:13
问题 I have a query where I need to calculate memory utilization using pyspark. I had achieved this with python pandas using pivot but now I need to do it in pyspark and pivoting would be an expensive function so I would like to know if there is any alternative in pyspark for this solution time_stamp Hostname kpi kpi_subtype value_current 2019/08/17 10:01:05 Server1 memory Total 100 2019/08/17 10:01:06 Server1 memory used 35 2019/08/17 10:01:09 Server1 memory buffer 8 2019/08/17 10:02:04 Server1