Iterate through a list of dataframes to drop particular rows Pandas

后端 未结 2 1820
抹茶落季
抹茶落季 2021-01-25 05:13

In my previous question where I asked to drop particular rows in Pandas

With help, I was to drop rows that before 1980. The \'Season\' column (that had the years) were i

相关标签:
2条回答
  • 2021-01-25 05:49

    You need create new list of filtered DataFrames or reaasign old one:

    Notice: Dont use variable list, because builtins (python code word).

    L = [df[df['Season'].str.split('-').str[0].astype(int) > 1980] for df in L]
    

    Loop version:

    output = []
    for df in L:
       df = df[df['Season'].str.split('-').str[0].astype(int) > 1980]
       output.append(df)
    

    If need extract only first integers with length 4:

    L = [df, df]
    L = [df[df['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980] 
              for df in L]
    
    print (L)
    [    Season
    0  2018-19
    1  2017-18,     Season
    0  2018-19
    1  2017-18]
    

    EDIT:

    If data have same structure I suggest create one big DataFrame with new column for distinguish cities:

    import glob
    
    files = glob.glob('files/*.csv')
    dfs = [pd.read_csv(fp).assign(City=os.path.basename(fp).split('.')[0]) for fp in files]
    df = pd.concat(dfs, ignore_index=True)
    print (df)
              Season           City
    0        2018-19   Boston_Sheet
    1           This   Boston_Sheet
    2  list would go   Boston_Sheet
    3      till 1960   Boston_Sheet
    4        2018-19  Chicago_Sheet
    5        2017-18  Chicago_Sheet
    6           This  Chicago_Sheet
    
    df1 = df[df['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980]
    print (df1)
         Season           City
    0   2018-19   Boston_Sheet
    4   2018-19  Chicago_Sheet
    5   2017-18  Chicago_Sheet
    
    df2 = df1[df1['City'] == 'Boston_Sheet']
    print (df2)
        Season          City
    0  2018-19  Boston_Sheet
    
    df3 = df1[df1['City'] == 'Chicago_Sheet']
    print (df3)
         Season           City
    4   2018-19  Chicago_Sheet
    5   2017-18  Chicago_Sheet
    

    If need each DataFrame separate, it is possible by dictionary of DataFrames:

    import glob
    
    files = glob.glob('files/*.csv')
    dfs_dict = {os.path.basename(fp).split('.')[0] : pd.read_csv(fp) for fp in files}
    
    print (dfs_dict)
    
    print (dfs_dict['Boston_Sheet'])
              Season
    0        2018-19
    1           This
    2  list would go
    3      till 1960
    
    print (dfs_dict['Chicago_Sheet'])
    0   2018-19
    1   2017-18
    2      This
    

    Then processing in dictionary comprehension:

    dfs_dict = {k:v[v['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980] 
                     for k, v in dfs_dict.items()}
    print (dfs_dict)
    {'Boston_Sheet':     Season
    0  2018-19, 'Chicago_Sheet':      Season
    0   2018-19
    1   2017-18}
    
    print (dfs_dict['Boston_Sheet'])
        Season
    0  2018-19
    
    print (dfs_dict['Chicago_Sheet'])
         Season
    0   2018-19
    1   2017-18
    
    0 讨论(0)
  • 2021-01-25 06:01

    If you want to modify the list in-place :

    for index in range(len(df_list)):
        df_list[index] = df_list[index].loc[df_list[index]['Season'].str.split('-').str[0].astype(int) > 1980]
    

    When you're looping through the list object itself, it creates a new object at each iteration, that's getting erased at each turn.

    If you're looping using the length of the list, and accessing your data through the index, you will modify the list itself, and not the copy you made with for some_copy_item in df_list.


    Minimal example :

        arr = [1, 2, 3, 4, 5]
        print(arr) # [1, 2, 3, 4, 5]
    
        for number in arr:
            number += 1
        print(arr) # [1, 2, 3, 4, 5]
    
        for idx in range(len(arr)):
            arr[idx] += 1
        print(arr) # [2, 3, 4, 5, 6]
    
    0 讨论(0)
提交回复
热议问题