PySpark: dynamic union of DataFrames with different columns

后端 未结 3 877
臣服心动
臣服心动 2021-01-07 15:15

Consider the arrays as shown here. I have 3 sets of array:

Array 1:

C1  C2  C3
1   2   3
9   5   6

Array 2:

C2 C3 C         


        
相关标签:
3条回答
  • 2021-01-07 15:44

    There are probably plenty of better ways to do it, but maybe the below is useful to anyone in the future.

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import lit
    
    spark = SparkSession.builder\
        .appName("DynamicFrame")\
        .getOrCreate()
    
    df01 = spark.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
    df02 = spark.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
    df03 = spark.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))
    
    dataframes = [df01, df02, df03]
    
    # Create a list of all the column names and sort them
    cols = set()
    for df in dataframes:
        for x in df.columns:
            cols.add(x)
    cols = sorted(cols)
    
    # Create a dictionary with all the dataframes
    dfs = {}
    for i, d in enumerate(dataframes):
        new_name = 'df' + str(i)  # New name for the key, the dataframe is the value
        dfs[new_name] = d
        # Loop through all column names. Add the missing columns to the dataframe (with value 0)
        for x in cols:
            if x not in d.columns:
                dfs[new_name] = dfs[new_name].withColumn(x, lit(0))
        dfs[new_name] = dfs[new_name].select(cols)  # Use 'select' to get the columns sorted
    
    # Now put it al together with a loop (union)
    result = dfs['df0']      # Take the first dataframe, add the others to it
    dfs_to_add = dfs.keys()  # List of all the dataframes in the dictionary
    dfs_to_add.remove('df0') # Remove the first one, because it is already in the result
    for x in dfs_to_add:
        result = result.union(dfs[x])
    result.show()
    

    Output:

    +---+---+---+---+
    | C1| C2| C3| C4|
    +---+---+---+---+
    |  1|  2|  3|  0|
    |  9|  5|  6|  0|
    |  0| 11| 12| 13|
    |  0| 10| 15| 16|
    |111|  0|  0|112|
    |110|  0|  0|115|
    +---+---+---+---+
    
    0 讨论(0)
  • 2021-01-07 15:44

    Here's the version in scala -

    https://stackoverflow.com/a/60702657/9445912

    On question-

    Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema

    0 讨论(0)
  • 2021-01-07 15:47

    I would try

    df = df1.join(df2, ['each', 'shared', 'col], how='full')
    
    0 讨论(0)
提交回复
热议问题