问题
I would like to drop NaN rows in the final file in a for loop loading in excel files, and dropping all company, emails, created duplicated columns from all but the final loaded in excel file.
Here is my for loop (and subsequent merging into a single DF), currently:
for f in glob.glob("./gowall-users-export-*.xlsx"):
df = pd.read_excel(f)
all_users_sheets_hosts.append(df)
j = re.search('(\d+)', f)
df.columns = df.columns.str.replace('.*Hosted Meetings.*', 'Hosted Meetings' + ' ' + j.group(1))
all_users_sheets_hosts = reduce(lambda left,right: pd.merge(left,right,on=['First Name', 'Last Name'], how='outer'), all_users_sheets_hosts)
Here are the first few rows of the resulting DF:
Company_x First Name Last Name Emails_x Created_x Hosted Meetings 03112016 Facilitated Meetings_x Attended Meetings_x Company_y Emails_y ... Created_x Hosted Meetings 04122016 Facilitated Meetings_x Attended Meetings_x Company_y Emails_y Created_y Hosted Meetings 04212016 Facilitated Meetings_y Attended Meetings_y
0 TS X Y X@Y.com 03/10/2016 0.0 0.0 0.0 TS X@Y.com ... 03/10/2016 0.0 0.0 2.0 NaN NaN NaN NaN NaN NaN
1 TS X Y X@Y.com 03/10/2016 0.0 0.0 0.0 TS X@Y.com ... 01/25/2016 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN
2 TS X Y X@Y.com 03/10/2016 0.0 0.0 0.0 TS X@Y.com ... 04/06/2015 9.0 10.0 17.0 NaN NaN NaN NaN NaN NaN
回答1:
To prevent multiple Company
, Emails
, Created
, Facilitated Meetings
and Attended Meetings
columns, drop them from the right
DataFrame. To remove rows with all NaN
values, use result.dropna(how='all', axis=0)
:
import pandas as pd
import functools
for f in glob.glob("./gowall-users-export-*.xlsx"):
df = pd.read_excel(f)
all_users_sheets_hosts.append(df)
j = re.search('(\d+)', f)
df.columns = df.columns.str.replace('.*Hosted Meetings.*',
'Hosted Meetings' + ' ' + j.group(1))
# Drop rows of all NaNs from the final DataFrame in `all_users_sheets_hosts`
all_users_sheets_hosts[-1] = all_users_sheets_hosts[-1].dropna(how='all', axis=0)
def mergefunc(left, right):
cols = ['Company', 'Emails', 'Created', 'Facilitated Meetings', 'Attended Meetings']
right = right.drop(cols, axis=1)
result = pd.merge(left, right, on=['First Name', 'Last Name'], how='outer')
return result
all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)
Since the Company
et. al. columns will only exist in the left
DataFrame, there will be no proliferation of those columns. Note, however, that if the left
and right
DataFrames have different values in those columns, only the values in the first DataFrame in all_users_sheets_hosts
will be kept.
Alternative, if the left
and right
DataFrames have the same values for the Company
et. al. columns, then another option would be to simple merge on those columns too:
def mergefunc(left, right):
cols = ['First Name', 'Last Name', 'Company', 'Emails', 'Created',
'Facilitated Meetings', 'Attended Meetings']
result = pd.merge(left, right, on=cols, how='outer')
return result
all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)
来源:https://stackoverflow.com/questions/36944960/dropping-nan-rows-certain-columns-in-specific-excel-files-using-glob-merge