Python/Pandas: Loop to append Excel files ONLY if each Excel file contains certain values

问题

I have the following code:

dfs = []
    for f in files_xlsx:
            city_name = pd.read_excel(f, "1. City", nrows=1, parse_cols="C", header=None, skiprows=1)
            country_code = pd.read_excel(f, "1. City", nrows=1, parse_cols="C", header=None, skiprows=2)
            data = pd.read_excel(f, "1. City", parse_cols="B:J", header=None, skiprows=8)
            data['City name'] = city_name.iat[0,0]
            data['City code'] = country_code.iat[0,0]
            dfs.append(data)

    df = pd.concat(dfs, ignore_index=True)

I would like to run the loop if and only if each Excel file contains the values 91, 92, 93, 94, 95, 96, 97 in the following location (column and row combination) D8, E8, F8, G8, H8, I8, J8. The loop should only run if this condition is met across all files.

All my Excel files have the same format in theory. In practice, they often don't so I want to run a check before appending them. It would be great if the code could tell me which file does not meet the above condition. Thank you.

Edit:

In[1]: data

Out[1]:

0   1   2   3   4   5   6   7   8   City name   City code
0   x   x   x   x   x   x   x   x       x           x
1   x   x   x   x   x   x   x   x       x           x
2   x   x   x   x   x   x   x   x       x           x

回答1:

Consider building a helper data frame that resembles Excel data but with values matched to specification (row 8 and columns D-J with those specific values). Then in loop iteratively merge against this helper data frame and if it returns a match, conditionally append to your list of data frames.

NOTE: Adjust columns to actual column names, replacing list('ABCDEFGHIJ') with list of names, ['col1','col2','col3',...], in DataFrame() and merge() calls.

check_df = pd.concat([pd.DataFrame([[0]*9 for _ in range(7)], 
                                    columns=['Heading code','Heading name','91','92','93','94','95','96','97']),
                      pd.DataFrame([[0,0,91,92,93,94,95,96,97]], 
                                   columns=['Heading code','Heading name','91','92','93','94','95','96','97'])],
                      ignore_index=True).reset_index()

print(check_df)    
#    index  Heading code  Heading name  91  92  93  94  95  96  97
# 0      0             0             0   0   0   0   0   0   0   0
# 1      1             0             0   0   0   0   0   0   0   0
# 2      2             0             0   0   0   0   0   0   0   0
# 3      3             0             0   0   0   0   0   0   0   0
# 4      4             0             0   0   0   0   0   0   0   0
# 5      5             0             0   0   0   0   0   0   0   0
# 6      6             0             0   0   0   0   0   0   0   0
# 7      7             0             0  91  92  93  94  95  96  97

dfs = []
for f in files_xlsx:
   city_name = pd.read_excel(f, "1. City", nrows=1, parse_cols="C", header=None, skiprows=1)
   country_code = pd.read_excel(f, "1. City", nrows=1, parse_cols="C", header=None, skiprows=2)
   data = pd.read_excel(f, "1. City", parse_cols="B:J", header=None, skiprows=8)\
            .assign(city_name=city_name.iat[0,0], city_code=country_code.iat[0,0])
   data.columns = ['Heading code','Heading name','91','92','93','94','95','96','97','City name','City code']

   # INNER JOIN MERGE ON INDEX AND COLS, D-J
   tmp = data.reset_index().merge(check_df, on=['index','91','92','93','94','95','96','97'])    

   # CONDITIONALLY APPEND
   if len(tmp) > 0:
        dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

Below demonstrates with random data

np.random.seed(82118)
# LIST OF FIVE DATAFRAMES (TO RESEMBLE EXCEL DFs)
rand_dfs = [pd.DataFrame([np.random.randint(1, 100, 9) for _ in range(10)],
                    columns=['Heading code','Heading name','91','92','93','94','95','96','97'])
            for _ in range(5)]

# UPDATE TWO DATAFRAMES EACH WITH 10 COLS TO INCLUDE MATCHING 8TH ROW
rand_dfs[2].loc[7] = [0, 0, 91, 92, 93, 94, 95, 96, 97]    
rand_dfs[4].loc[7] = [0, 0, 91, 92, 93, 94, 95, 96, 97]

final_dfs = []
for d in rand_dfs:
    tmp = d.reset_index().merge(check_df, on=['index','91','92','93','94','95','96','97'])

    if len(tmp) > 0: 
        final_dfs.append(d)

final_df = pd.concat(final_dfs, ignore_index=True)

Output (see 8th and 17th rows with matching criteria)

print(final_df)

#     Heading code  Heading name  91  92  93  94  95  96  97
# 0             53            98  67   8  86  33  65  56  62
# 1             61             9  40  14  18   9  53  30  24
# 2             89            88  80  91  91  49   8  39  84
# 3             15            99  49  92  63  96  11  95  29
# 4             13            62  82  12  34  92  54  29  47
# 5             44            18  67  61  52  71  52  25  12
# 6             56            25  52  10  82  12  59  63  15
# 7              0             0  91  92  93  94  95  96  97
# 8             51            50  27  38  34  11  57  92   3
# 9             49            99  46  87  46   5  63  24   8
# 10            31            62   8  23  19  66  60  10  66
# 11            51            98  30  44  45  39  32  74  82
# 12            88            19  54  28  38  71   3  31  34
# 13            58            13  89  17  96  35  12  52  85
# 14            93            67  13  13  28  43  24   7   4
# 15            34            26  73  20  44  37  18  17  22
# 16            59             1  99   9  11   6   4  99  95
# 17             0             0  91  92  93  94  95  96  97
# 18            88             6  23  20  35  26  37  56  51
# 19            21            67  19  63  77  98  41   9  22

来源：https://stackoverflow.com/questions/51938430/python-pandas-loop-to-append-excel-files-only-if-each-excel-file-contains-certa

标签

python

pandas

numpy

openpyxl

pyexcel