Compare two dataframes Pyspark

前端 未结 4 1746
臣服心动
臣服心动 2021-02-04 22:28

I\'m trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames

df1 = spark.read.csv(\"/path/to/         


        
4条回答
  •  深忆病人
    2021-02-04 22:58

    Assuming that we can use id to join these two datasets I don't think that there is a need for UDF. This could be solved just by using inner join, array and array_remove functions among others.

    First let's create the two datasets:

    df1 = spark.createDataFrame([
      [1, "ABC", 5000, "US"],
      [2, "DEF", 4000, "UK"],
      [3, "GHI", 3000, "JPN"],
      [4, "JKL", 4500, "CHN"]
    ], ["id", "name", "sal", "Address"])
    
    df2 = spark.createDataFrame([
      [1, "ABC", 5000, "US"],
      [2, "DEF", 4000, "CAN"],
      [3, "GHI", 3500, "JPN"],
      [4, "JKL_M", 4800, "CHN"]
    ], ["id", "name", "sal", "Address"])
    

    First we do an inner join between the two datasets then we generate the condition df1[col] != df2[col] for each column except id. When the columns aren't equal we return the column name otherwise an empty string. The list of conditions will consist the items of an array from which finally we remove the empty items:

    from pyspark.sql.functions import col, array, when, array_remove
    
    # get conditions for all columns except id
    conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']
    
    select_expr =[
                    col("id"), 
                    *[df2[c] for c in df2.columns if c != 'id'], 
                    array_remove(array(*conditions_), "").alias("column_names")
    ]
    
    df1.join(df2, "id").select(*select_expr).show()
    
    # +---+-----+----+-------+------------+
    # | id| name| sal|Address|column_names|
    # +---+-----+----+-------+------------+
    # |  1|  ABC|5000|     US|          []|
    # |  3|  GHI|3500|    JPN|       [sal]|
    # |  2|  DEF|4000|    CAN|   [Address]|
    # |  4|JKL_M|4800|    CHN| [name, sal]|
    # +---+-----+----+-------+------------+
    

提交回复
热议问题