Concatenate two PySpark dataframes

后端 未结 10 1313
独厮守ぢ
独厮守ぢ 2020-12-02 16:28

I\'m trying to concatenate two PySpark dataframes with some columns that are only on each of them:

from pyspark.sql.functions import randn, rand

df_1 = sqlC         


        
相关标签:
10条回答
  • 2020-12-02 17:17

    To make it more generic of keeping both columns in df1 and df2:

    import pyspark.sql.functions as F
    
    # Keep all columns in either df1 or df2
    def outter_union(df1, df2):
    
        # Add missing columns to df1
        left_df = df1
        for column in set(df2.columns) - set(df1.columns):
            left_df = left_df.withColumn(column, F.lit(None))
    
        # Add missing columns to df2
        right_df = df2
        for column in set(df1.columns) - set(df2.columns):
            right_df = right_df.withColumn(column, F.lit(None))
    
        # Make sure columns are ordered the same
        return left_df.union(right_df.select(left_df.columns))
    
    0 讨论(0)
  • 2020-12-02 17:22

    You can use unionByName to make this:

    df = df_1.unionByName(df_2)
    

    unionByName is available since Spark 2.3.0.

    0 讨论(0)
  • 2020-12-02 17:24
    df_concat = df_1.union(df_2)
    

    The dataframes may need to have identical columns, in which case you can use withColumn() to create normal_1 and normal_2

    0 讨论(0)
  • 2020-12-02 17:25

    To concatenate multiple pyspark dataframes into one:

    from functools import reduce
    
    reduce(lambda x,y:x.union(y), [df_1,df_2])
    

    And you can replace the list of [df_1, df_2] to a list of any length.

    0 讨论(0)
提交回复
热议问题