update a dataframe column with new values

前端 未结 2 986
暖寄归人
暖寄归人 2020-12-17 04:50

df1 has fields id and json; df2 has fields idand json

df1.count() => 1200; df2.count()

相关标签:
2条回答
  • 2020-12-17 05:14

    If you want data from both the dataframe you can union two dataframe

    import spark.implicits._
    

    First Dataframe

    val df1 = Seq(
      (1, "a"),
      (2, "b"),
      (3, "c")
    ).toDF("id", "value")    
    

    Second dataframe

    val df2 = Seq(
      (1, "x"), 
      (2, "y")
    ).toDF("id", "value")
    

    To get the result as both the data from df1 and df2, use union

    val resultDF = df1.union(df2)
    
    resultDF.show()
    

    Output :

    +---+-----+
    |id |value|
    +---+-----+
    |1  |a    |
    |2  |b    |
    |3  |c    |
    |1  |x    |
    |2  |y    |
    +---+-----+
    
    0 讨论(0)
  • 2020-12-17 05:32

    You can achieve this using one left join.

    Create Example DataFrames

    Using the sample data provided by @Shankar Koirala in his answer.

    data1 = [
      (1, "a"),
      (2, "b"),
      (3, "c")
    ]
    df1 = sqlCtx.createDataFrame(data1, ["id", "value"])
    
    data2 = [
      (1, "x"), 
      (2, "y")
    ]
    
    df2 = sqlCtx.createDataFrame(data2, ["id", "value"])
    

    Do a left join

    Join the two DataFrames using a left join on the id column. This will keep all of the rows in the left DataFrame. For the rows in the right DataFrame that don't have a matching id, the value will be null.

    import pyspark.sql.functions as f
    df1.alias('l').join(df2.alias('r'), on='id', how='left')\
        .select(
            'id',
             f.col('l.value').alias('left_value'),
             f.col('r.value').alias('right_value')
        )\
        .show()
    #+---+----------+-----------+
    #| id|left_value|right_value|
    #+---+----------+-----------+
    #|  1|         a|          x|
    #|  3|         c|       null|
    #|  2|         b|          y|
    #+---+----------+-----------+
    

    Select the desired data

    We will use the fact that the unmatched ids have a null to select the final columns. Use pyspark.sql.functions.when() to use the right value if it is not null, otherwise keep the left value.

    df1.alias('l').join(df2.alias('r'), on='id', how='left')\
        .select(
            'id',
            f.when(
                ~f.isnull(f.col('r.value')),
                f.col('r.value')
            ).otherwise(f.col('l.value')).alias('value')
        )\
        .show()
    #+---+-----+
    #| id|value|
    #+---+-----+
    #|  1|    x|
    #|  3|    c|
    #|  2|    y|
    #+---+-----+
    

    You can sort this output if you want the ids in order.


    Using pyspark-sql

    You can do the same thing using a pyspark-sql query:

    df1.registerTempTable('df1')
    df2.registerTempTable('df2')
    
    query = """SELECT l.id, 
    CASE WHEN r.value IS NOT NULL THEN r.value ELSE l.value END AS value 
    FROM df1 l LEFT JOIN df2 r ON l.id = r.id"""
    sqlCtx.sql(query.replace("\n", "")).show()
    #+---+-----+
    #| id|value|
    #+---+-----+
    #|  1|    x|
    #|  3|    c|
    #|  2|    y|
    #+---+-----+
    
    0 讨论(0)
提交回复
热议问题