I have read data in chunks over a pyodbc connection using something like this :
import pandas as pd
import pyodbc
conn = pyodbc.connect(\"Some connection Det
The documentation for .unionAll() states that it returns a new dataframe so you'd have to assign back to the df2
DataFrame:
i = 0
for chunk in df1:
if i==0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
i = i+1
Furthermore you can instead use enumerate() to avoid having to manage the i
variable yourself:
for i,chunk in enumerate(df1):
if i == 0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
Furthermore the documentation for .unionAll()
states that .unionAll()
is deprecated and now you should use .union() which acts like UNION ALL in SQL:
for i,chunk in enumerate(df1):
if i == 0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.union(sqlContext.createDataFrame(chunk))
Edit:
Furthermore I'll stop saying furthermore but not before I say furthermore: As @zero323 says let's not use .union()
in a loop. Let's instead do something like:
def unionAll(*dfs):
' by @zero323 from here: http://stackoverflow.com/a/33744540/42346 '
first, *rest = dfs # Python 3.x, for 2.x you'll have to unpack manually
return first.sql_ctx.createDataFrame(
first.sql_ctx._sc.union([df.rdd for df in dfs]),
first.schema
)
df_list = []
for chunk in df1:
df_list.append(sqlContext.createDataFrame(chunk))
df_all = unionAll(df_list)