How to load data in chunks from a pandas dataframe to a spark dataframe

前端 未结 1 844
悲&欢浪女
悲&欢浪女 2021-01-20 06:59

I have read data in chunks over a pyodbc connection using something like this :

import pandas as pd
import pyodbc
conn = pyodbc.connect(\"Some connection Det         


        
相关标签:
1条回答
  • 2021-01-20 07:20

    The documentation for .unionAll() states that it returns a new dataframe so you'd have to assign back to the df2 DataFrame:

    i = 0
    for chunk in df1:
        if i==0:
            df2 = sqlContext.createDataFrame(chunk)
        else:
            df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
        i = i+1
    

    Furthermore you can instead use enumerate() to avoid having to manage the i variable yourself:

    for i,chunk in enumerate(df1):
        if i == 0:
            df2 = sqlContext.createDataFrame(chunk)
        else:
            df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
    

    Furthermore the documentation for .unionAll() states that .unionAll() is deprecated and now you should use .union() which acts like UNION ALL in SQL:

    for i,chunk in enumerate(df1):
        if i == 0:
            df2 = sqlContext.createDataFrame(chunk)
        else:
            df2 = df2.union(sqlContext.createDataFrame(chunk))
    

    Edit:
    Furthermore I'll stop saying furthermore but not before I say furthermore: As @zero323 says let's not use .union() in a loop. Let's instead do something like:

    def unionAll(*dfs):
        ' by @zero323 from here: http://stackoverflow.com/a/33744540/42346 '
        first, *rest = dfs  # Python 3.x, for 2.x you'll have to unpack manually
        return first.sql_ctx.createDataFrame(
            first.sql_ctx._sc.union([df.rdd for df in dfs]),
            first.schema
        )
    
    df_list = []
    for chunk in df1:
        df_list.append(sqlContext.createDataFrame(chunk))
    
    df_all = unionAll(df_list)
    
    0 讨论(0)
提交回复
热议问题