pyspark-dataframes

How to explode an array without duplicate records

為{幸葍}努か 提交于 2020-06-17 09:38:06
问题 This is continuation to the question here in pyspark sql Add different Qtr start_date, End_date for exploded rows. Thanks. I have the following dataframe which has a array list as a column. +--------------+------------+----------+----------+---+---------+-----------+----------+ customer_number|sales_target|start_date|end_date |noq|cf_values|new_sdt |new_edate | +--------------+------------+----------+----------+---+---------------------+----------+ |A011021 |15 |2020-01-01|2020-12-31|4 |[4,4

pySpark mapping multiple columns

▼魔方 西西 提交于 2020-05-19 17:50:33
问题 I need to be able to compare two dataframes using multiple columns. pySpark attempt I decided to filter the reference dataframe by one level (reference_df. PrimaryLookupAttributeName compare to df1.LeaseStatus) How can I iterate over the list of primaryLookupAttributeName_List and avoid hardcoding, LeaseStatus ? get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1. output a new df with the found/match values. I decided to hard code, FOUND because

How to store JSON dataframe with comma sepearted

心已入冬 提交于 2020-05-17 08:46:46
问题 I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2} , I want to write the dataframe like this [{"a":1} ,{"b":2}] . Can you please help me. Thanks in advance. 回答1: Use to_json function to create array of json objects then use .saveAsTextFile to save the json object. Example: #sample dataframe df=spark.createDataFrame([("a",1),("b",2)],["id","name"]) from pyspark.sql.functions import * df.groupBy(lit("1")).\ agg(collect_list

How to store JSON dataframe with comma sepearted

ぐ巨炮叔叔 提交于 2020-05-17 08:46:18
问题 I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2} , I want to write the dataframe like this [{"a":1} ,{"b":2}] . Can you please help me. Thanks in advance. 回答1: Use to_json function to create array of json objects then use .saveAsTextFile to save the json object. Example: #sample dataframe df=spark.createDataFrame([("a",1),("b",2)],["id","name"]) from pyspark.sql.functions import * df.groupBy(lit("1")).\ agg(collect_list

concatenating two columns in pyspark data frame according to alphabets order [duplicate]

萝らか妹 提交于 2020-04-17 22:54:55
问题 This question already has an answer here : how to sort value before concatenate text columns in pyspark (1 answer) Closed 11 days ago . I have a pyspark data Frame with 5M data and I am going to apply fuzzy logic(Levenshtein and Soundex functions) to find duplicates at the first name and last name columns. Inpt data Before that, I want to do resequence first name and last name columns value so that I get correct Levenshtein distance. df = df.withColumn('full_name', f.concat(f.col('first'),f

get distinct count from an array of each rows using pyspark

孤人 提交于 2020-04-16 03:31:34
问题 I am looking for distinct counts from an array of each rows using pyspark dataframe: input: col1 [1,1,1] [3,4,5] [1,2,1,2] output: 1 3 2 I used below code but it is giving me the length of an array: output: 3 3 4 please help me how do i achieve this using python pyspark dataframe. slen = udf(lambda s: len(s), IntegerType()) count = Df.withColumn("Count", slen(df.col1)) count.show() Thanks in advanced ! 回答1: For spark2.4+ you can use array_distinct and then just get the size of that, to get

get distinct count from an array of each rows using pyspark

随声附和 提交于 2020-04-16 03:31:14
问题 I am looking for distinct counts from an array of each rows using pyspark dataframe: input: col1 [1,1,1] [3,4,5] [1,2,1,2] output: 1 3 2 I used below code but it is giving me the length of an array: output: 3 3 4 please help me how do i achieve this using python pyspark dataframe. slen = udf(lambda s: len(s), IntegerType()) count = Df.withColumn("Count", slen(df.col1)) count.show() Thanks in advanced ! 回答1: For spark2.4+ you can use array_distinct and then just get the size of that, to get

how to sort value before concatenate text columns in pyspark

纵然是瞬间 提交于 2020-04-07 08:00:13
问题 I need help to convert below code in Pyspark code or Pyspark sql code. df["full_name"] = df.apply(lambda x: "_".join(sorted((x["first"], x["last"]))), axis=1) Its basically adding one new column name full_name which have to concatenate values of the columns first and last in a sorted way. I have done below code but don't know how to apply to sort in a columns text value. df= df.withColumn('full_name', f.concat(f.col('first'),f.lit('_'), f.col('last'))) 回答1: From Spark-2.4+ : We can use array

How to extract floats from vector columns in PySpark?

风格不统一 提交于 2020-03-28 06:40:25
问题 My Spark DataFrame has data in the following format: The printSchema() shows that each column is of the type vector . I tried to get the values out of [ and ] using the code below (for 1 columns col1 ): from pyspark.sql.functions import udf from pyspark.sql.types import FloatType firstelement=udf(lambda v:float(v[0]),FloatType()) df.select(firstelement('col1')).show() However, how can I apply it to all columns of df ? 回答1: 1. Extract first element of a single vector column: To get the first

Fill in missing values based on series and populate second row based on previous or next row in pyspark

萝らか妹 提交于 2020-03-25 17:50:14
问题 I have a csv with 4 columns. The file contains some missing rows based on the series. Input:- No A B C 1 10 50 12 3 40 50 12 4 20 60 15 6 80 80 18 Output:- No A B C 1 10 50 12 2 10 50 12 3 40 50 12 4 20 60 15 5 20 60 15 6 80 80 18 I need pyspark code to generate the above output. 来源: https://stackoverflow.com/questions/60681807/fill-in-missing-values-based-on-series-and-populate-second-row-based-on-previous