I am trying to concat dataframes based on the foll. 2 csv files:
df_a: https://www.dropbox.com/s/slcu7o7yyottujl/df_current.csv?dl=0
df_b: https://www.dropbo
I believe that this error occurs if the following two conditions are met:
(df1.columns == df2.columns)
is False
Basically if you concat
dataframes with columns [A,B,C]
and [B,C,D]
it can work out to make one series for each distinct column name. So if I try to join a third dataframe [B,B,C]
it does not know which column to append and ends up with fewer distinct columns than it thinks it needs.
If your dataframes are such that df1.columns == df2.columns
then it will work anyway. So you can join [B,B,C]
to [B,B,C]
, but not to [C,B,B]
, as if the columns are identical it probably just uses the integer indexes or something.
You can get around this issue with a 'manual' concatenation, in this case your
list_of_dfs = [df_a, df_b]
And instead of running
giant_concat_df = pd.concat(list_of_dfs,0)
You can use turn all of the dataframes to a list of dictionaries and then make a new data frame from these lists (merged with chain)
from itertools import chain
list_of_dicts = [cur_df.T.to_dict().values() for cur_df in list_of_dfs]
giant_concat_df = pd.DataFrame(list(chain(*list_of_dicts)))
The answers here did not solve my issue, but this answer did.
The Issue was duplicated columns in one or both DataFrames.
Here's a duplicated column fix(as per answer above):
df = df.loc[:,~df.columns.duplicated()]
Unfortunately, the source files are already unavailable, so I can't check my solution in your case. In my case the error occurred when:
ID
and id
columns, which I then converted to lower case, so they become the same)Here is an example which gives me the error in question:
df1 = pd.DataFrame(data=[
['a', 'b', 'id', 1],
['a', 'b', 'id', 2]
], columns=['A', 'B', 'id', 'id'])
df2 = pd.DataFrame(data=[
['b', 'c', 'id', 1],
['b', 'c', 'id', 2]
], columns=['B', 'C', 'id', 'id'])
pd.concat([df1, df2])
>>> AssertionError: Number of manager items must equal union of block items
# manager items: 4, # tot_items: 5
Removing / renaming one of the columns makes this code work.