Dropping NaN rows, certain columns in specific excel files using glob/merge

问题

I would like to drop NaN rows in the final file in a for loop loading in excel files, and dropping all company, emails, created duplicated columns from all but the final loaded in excel file.

Here is my for loop (and subsequent merging into a single DF), currently:

for f in glob.glob("./gowall-users-export-*.xlsx"):
    df = pd.read_excel(f)
    all_users_sheets_hosts.append(df)
    j = re.search('(\d+)', f)
    df.columns = df.columns.str.replace('.*Hosted Meetings.*', 'Hosted Meetings' + ' ' + j.group(1))

all_users_sheets_hosts = reduce(lambda left,right: pd.merge(left,right,on=['First Name', 'Last Name'], how='outer'), all_users_sheets_hosts)

Here are the first few rows of the resulting DF:

Company_x   First Name  Last Name   Emails_x    Created_x   Hosted Meetings 03112016    Facilitated Meetings_x  Attended Meetings_x Company_y   Emails_y    ... Created_x   Hosted Meetings 04122016    Facilitated Meetings_x  Attended Meetings_x Company_y   Emails_y    Created_y   Hosted Meetings 04212016    Facilitated Meetings_y  Attended Meetings_y
0   TS  X Y X@Y.com 03/10/2016  0.0 0.0 0.0 TS  X@Y.com ... 03/10/2016  0.0 0.0 2.0 NaN NaN NaN NaN NaN NaN
1   TS  X Y X@Y.com 03/10/2016  0.0 0.0 0.0 TS  X@Y.com ... 01/25/2016  0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN
2   TS  X Y X@Y.com 03/10/2016  0.0 0.0 0.0 TS  X@Y.com ... 04/06/2015  9.0 10.0    17.0    NaN NaN NaN NaN NaN NaN

回答1:

To prevent multiple Company, Emails, Created, Facilitated Meetings and Attended Meetings columns, drop them from the right DataFrame. To remove rows with all NaN values, use result.dropna(how='all', axis=0):

import pandas as pd
import functools

for f in glob.glob("./gowall-users-export-*.xlsx"):
    df = pd.read_excel(f)
    all_users_sheets_hosts.append(df)
    j = re.search('(\d+)', f)
    df.columns = df.columns.str.replace('.*Hosted Meetings.*', 
                                        'Hosted Meetings' + ' ' + j.group(1))

# Drop rows of all NaNs from the final DataFrame in `all_users_sheets_hosts`
all_users_sheets_hosts[-1] = all_users_sheets_hosts[-1].dropna(how='all', axis=0)

def mergefunc(left, right):
    cols = ['Company', 'Emails', 'Created', 'Facilitated Meetings', 'Attended Meetings']
    right = right.drop(cols, axis=1)
    result = pd.merge(left, right, on=['First Name', 'Last Name'], how='outer')
    return result

all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)

Since the Company et. al. columns will only exist in the left DataFrame, there will be no proliferation of those columns. Note, however, that if the left and right DataFrames have different values in those columns, only the values in the first DataFrame in all_users_sheets_hosts will be kept.

Alternative, if the left and right DataFrames have the same values for the Company et. al. columns, then another option would be to simple merge on those columns too:

def mergefunc(left, right):
    cols = ['First Name', 'Last Name', 'Company', 'Emails', 'Created', 
            'Facilitated Meetings', 'Attended Meetings']
    result = pd.merge(left, right, on=cols, how='outer')
    return result
all_users_sheets_hosts = functools.reduce(mergefunc, all_users_sheets_hosts)

来源：https://stackoverflow.com/questions/36944960/dropping-nan-rows-certain-columns-in-specific-excel-files-using-glob-merge

标签

python

pandas

glob