Iterate through a list of dataframes to drop particular rows Pandas

后端 未结 2 1821
抹茶落季
抹茶落季 2021-01-25 05:13

In my previous question where I asked to drop particular rows in Pandas

With help, I was to drop rows that before 1980. The \'Season\' column (that had the years) were i

2条回答
  •  别那么骄傲
    2021-01-25 05:49

    You need create new list of filtered DataFrames or reaasign old one:

    Notice: Dont use variable list, because builtins (python code word).

    L = [df[df['Season'].str.split('-').str[0].astype(int) > 1980] for df in L]
    

    Loop version:

    output = []
    for df in L:
       df = df[df['Season'].str.split('-').str[0].astype(int) > 1980]
       output.append(df)
    

    If need extract only first integers with length 4:

    L = [df, df]
    L = [df[df['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980] 
              for df in L]
    
    print (L)
    [    Season
    0  2018-19
    1  2017-18,     Season
    0  2018-19
    1  2017-18]
    

    EDIT:

    If data have same structure I suggest create one big DataFrame with new column for distinguish cities:

    import glob
    
    files = glob.glob('files/*.csv')
    dfs = [pd.read_csv(fp).assign(City=os.path.basename(fp).split('.')[0]) for fp in files]
    df = pd.concat(dfs, ignore_index=True)
    print (df)
              Season           City
    0        2018-19   Boston_Sheet
    1           This   Boston_Sheet
    2  list would go   Boston_Sheet
    3      till 1960   Boston_Sheet
    4        2018-19  Chicago_Sheet
    5        2017-18  Chicago_Sheet
    6           This  Chicago_Sheet
    
    df1 = df[df['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980]
    print (df1)
         Season           City
    0   2018-19   Boston_Sheet
    4   2018-19  Chicago_Sheet
    5   2017-18  Chicago_Sheet
    
    df2 = df1[df1['City'] == 'Boston_Sheet']
    print (df2)
        Season          City
    0  2018-19  Boston_Sheet
    
    df3 = df1[df1['City'] == 'Chicago_Sheet']
    print (df3)
         Season           City
    4   2018-19  Chicago_Sheet
    5   2017-18  Chicago_Sheet
    

    If need each DataFrame separate, it is possible by dictionary of DataFrames:

    import glob
    
    files = glob.glob('files/*.csv')
    dfs_dict = {os.path.basename(fp).split('.')[0] : pd.read_csv(fp) for fp in files}
    
    print (dfs_dict)
    
    print (dfs_dict['Boston_Sheet'])
              Season
    0        2018-19
    1           This
    2  list would go
    3      till 1960
    
    print (dfs_dict['Chicago_Sheet'])
    0   2018-19
    1   2017-18
    2      This
    

    Then processing in dictionary comprehension:

    dfs_dict = {k:v[v['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980] 
                     for k, v in dfs_dict.items()}
    print (dfs_dict)
    {'Boston_Sheet':     Season
    0  2018-19, 'Chicago_Sheet':      Season
    0   2018-19
    1   2017-18}
    
    print (dfs_dict['Boston_Sheet'])
        Season
    0  2018-19
    
    print (dfs_dict['Chicago_Sheet'])
         Season
    0   2018-19
    1   2017-18
    

提交回复
热议问题