E-num / get Dummies in pyspark

后端 未结 4 833
野的像风
野的像风 2020-12-18 08:58

I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additiona

相关标签:
4条回答
  • 2020-12-18 09:17

    The solutions provided by Freek Wiemkeijer and Rakesh Kumar are perfectly adequate, however, since I coded it up, I thought it was worth posting this generic solution as it doesn't require hard coding of the column names.

    pivot_cols = ['TYPE','CODE']
    keys = ['ID','TYPE','CODE']
    
    before = sc.parallelize([(1,'A','X1'),
                             (2,'B','X2'),
                             (3,'B','X3'),
                             (1,'B','X3'),
                             (2,'C','X2'),
                             (3,'C','X2'),
                             (1,'C','X1'),
                             (1,'B','X1')]).toDF(['ID','TYPE','CODE'])                         
    
    #Helper function to recursively join a list of dataframes
    #Can be simplified if you only need two columns
    def join_all(dfs,keys):
        if len(dfs) > 1:
            return dfs[0].join(join_all(dfs[1:],keys), on = keys, how = 'inner')
        else:
            return dfs[0]
    
    dfs = []
    combined = []
    for pivot_col in pivot_cols:
        pivotDF = before.groupBy(keys).pivot(pivot_col).count()
        new_names = pivotDF.columns[:len(keys)] +  ["e_{0}_{1}".format(pivot_col, c) for c in pivotDF.columns[len(keys):]]        
        df = pivotDF.toDF(*new_names).fillna(0)    
        combined.append(df)
    
    join_all(combined,keys).show()
    

    This gives as output:

    +---+----+----+--------+--------+--------+---------+---------+---------+
    | ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
    +---+----+----+--------+--------+--------+---------+---------+---------+
    |  1|   A|  X1|       1|       0|       0|        1|        0|        0|
    |  2|   C|  X2|       0|       0|       1|        0|        1|        0|
    |  3|   B|  X3|       0|       1|       0|        0|        0|        1|
    |  2|   B|  X2|       0|       1|       0|        0|        1|        0|
    |  3|   C|  X2|       0|       0|       1|        0|        1|        0|
    |  1|   B|  X3|       0|       1|       0|        0|        0|        1|
    |  1|   B|  X1|       0|       1|       0|        1|        0|        0|
    |  1|   C|  X1|       0|       0|       1|        1|        0|        0|
    +---+----+----+--------+--------+--------+---------+---------+---------+
    
    0 讨论(0)
  • 2020-12-18 09:19

    First you need to collect distinct values of TYPES and CODE. Then either select add column with name of each value using withColumn or use select fro each column. Here is sample code using select statement:-

    import pyspark.sql.functions as F
    df = sqlContext.createDataFrame([
        (1, "A", "X1"),
        (2, "B", "X2"),
        (3, "B", "X3"),
        (1, "B", "X3"),
        (2, "C", "X2"),
        (3, "C", "X2"),
        (1, "C", "X1"),
        (1, "B", "X1"),
    ], ["ID", "TYPE", "CODE"])
    
    types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
    codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
    types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
    codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
    df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
    df.show()
    

    OUTPUT

    +---+----+----+--------+--------+--------+---------+---------+---------+
    | ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
    +---+----+----+--------+--------+--------+---------+---------+---------+
    |  1|   A|  X1|       1|       0|       0|        1|        0|        0|
    |  2|   B|  X2|       0|       1|       0|        0|        1|        0|
    |  3|   B|  X3|       0|       1|       0|        0|        0|        1|
    |  1|   B|  X3|       0|       1|       0|        0|        0|        1|
    |  2|   C|  X2|       0|       0|       1|        0|        1|        0|
    |  3|   C|  X2|       0|       0|       1|        0|        1|        0|
    |  1|   C|  X1|       0|       0|       1|        1|        0|        0|
    |  1|   B|  X1|       0|       1|       0|        1|        0|        0|
    +---+----+----+--------+--------+--------+---------+---------+---------+
    
    0 讨论(0)
  • 2020-12-18 09:19

    The first step is to make a DataFrame from your CSV file.

    See Get CSV to Spark dataframe ; the first answer gives a line by line example.

    Then you can add the columns. Assume you have a DataFrame object called df, and the columns are: [ID, TYPE, CODE].

    The rest van be fixed with DataFrame.withColumn() and pyspark.sql.functions.when:

    from pyspark.sql.functions import when
    
    df_with_extra_columns = df.withColumn("e_TYPE_A", when(df.TYPE == "A", 1).otherwise(0).withColumn("e_TYPE_B", when(df.TYPE == "B", 1).otherwise(0)
    

    (this adds the first two columns. you get the point.)

    0 讨论(0)
  • 2020-12-18 09:20

    I was looking for the same solution but is scala, maybe this will help someone:

    val list = df.select("category").distinct().rdd.map(r => r(0)).collect()
    val oneHotDf = list.foldLeft(df)((df, category) => finalDf.withColumn("category_" + category, when(col("category") === category, 1).otherwise(0)))
    
    0 讨论(0)
提交回复
热议问题