Compare two dataframes Pyspark

前端 未结 4 1744
臣服心动 2021-02-04 22:28

I\'m trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames

df1 =\"/path/to/         

  •  青春惊慌失措
    2021-02-04 23:01

    Here is your solution with UDF, I have changed first dataframe name dynamically so that it will be not ambiguous during check. Go through below code and let me know in case any concerns.

    >>> from pyspark.sql.functions import *
    | id|name| sal|Address|
    |  1| ABC|5000|     US|
    |  2| DEF|4000|     UK|
    |  3| GHI|3000|    JPN|
    |  4| JKL|4500|    CHN|
    | id|name| sal|Address|
    |  1| ABC|5000|     US|
    |  2| DEF|4000|    CAN|
    |  3| GHI|3500|    JPN|
    |  4|JKLM|4800|    CHN|
    >>> df2 =[col(c).alias("x_"+c) for c in df.columns])
    >>> df3 = df1.join(df2, col("id") == col("x_id"), "left")
     //udf declaration 
    >>> def CheckMatch(Column,r):
    ...     check=''
    ...     ColList=Column.split(",")
    ...     for cc in ColList:
    ...             if(r[cc] != r["x_" + cc]):
    ...                     check=check + "," + cc
    ...     return check.replace(',','',1).split(",")
    >>> CheckMatchUDF = udf(CheckMatch)
    //final column that required to select
    >>> finalCol = df1.columns
    >>> finalCol.insert(len(finalCol), "column_names")
    >>> df3.withColumn("column_names", CheckMatchUDF(lit(','.join(df1.columns)),struct([df3[x] for x in df3.columns])))
    | id|name| sal|Address|column_names|
    |  1| ABC|5000|     US|          []|
    |  2| DEF|4000|    CAN|   [Address]|
    |  3| GHI|3500|    JPN|       [sal]|
    |  4|JKLM|4800|    CHN| [name, sal]|
