How to add a new column without Length Value error if match where/mask condition after groupby in python/pandas?

后端 未结 1 577
悲哀的现实
悲哀的现实 2021-01-26 08:43

I am trying to filter out inner data in my large data frame(1400,000 rows).
This is a very short and easy version of sample data,

a      b        c       dt         


        
相关标签:
1条回答
  • 2021-01-26 09:07

    If you only want a new column with the values that match with mask, as @Quang Hoang said, you could try this:

    import pandas as pd
    import io
    s_e='''
    a      b        c       dt                   e
    35   0.1      234   2020/6/15 14:27:00       0
    1    0.1      554   2020/6/15 15:28:00       1
    2    0.2      654   2020/6/15 16:29:00       0
    23   0.4      2345  2020/6/15 17:26:00       0
    34   0.8      245   2020/6/15 18:25:00       0
    8    0.9      123   2020/6/15 18:26:00       0
    7    0.1      22    2020/6/15 18:27:00       0
    2    0.3      99    2020/6/15 18:28:00       0
    219  0.2      17    2020/6/15 19:26:00       0
    '''
    df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', parse_dates=[3], engine='python')
    print(df)
    # masking the first set of conditions:
    mask = (df['a'].lt(25) & df['a'].gt(10) ) | df['b'].gt(0.2) | df['c'].gt(500)
    mask = mask & df['e'].eq(0)
    
    #Quang Hoang recomendation:
    df['indicator'] = mask.astype(int) 
    
    print(df)
    

    Output:

    df
         a    b     c                  dt  e  indicator
    0   35  0.1   234 2020-06-15 14:27:00  0          0
    1    1  0.1   554 2020-06-15 15:28:00  1          0
    2    2  0.2   654 2020-06-15 16:29:00  0          1
    3   23  0.4  2345 2020-06-15 17:26:00  0          1
    4   34  0.8   245 2020-06-15 18:25:00  0          1
    5    8  0.9   123 2020-06-15 18:26:00  0          1
    6    7  0.1    22 2020-06-15 18:27:00  0          0
    7    2  0.3    99 2020-06-15 18:28:00  0          1
    8  219  0.2    17 2020-06-15 19:26:00  0          0
    

    If you want to indicate only the rows that match with the mask and also with min c values by 30 mins, you could try:

    import pandas as pd
    import io
    s_e='''
    a      b        c       dt                   e
    35   0.1      234   2020/6/15 14:27:00       0
    1    0.1      554   2020/6/15 15:28:00       1
    2    0.2      654   2020/6/15 16:29:00       0
    23   0.4      2345  2020/6/15 17:26:00       0
    34   0.8      245   2020/6/15 18:25:00       0
    8    0.9      123   2020/6/15 18:26:00       0
    7    0.1      22    2020/6/15 18:27:00       0
    2    0.3      99    2020/6/15 18:28:00       0
    219  0.2      17    2020/6/15 19:26:00       0
    '''
    df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', parse_dates=[3], engine='python')
    print(df)
    # masking the first set of conditions:
    mask = (df['a'].lt(25) & df['a'].gt(10) ) | df['b'].gt(0.2) | df['c'].gt(500)
    mask = mask & df['e'].eq(0)
    df['indicator'] = [0]*len(df)
    
    temp = df[mask] 
    #select rows with min `c` values by 30 mins
    c_min = temp.groupby(temp['dt'].dt.floor('30min'))['c'].idxmin()
    
    # final df
    df.loc[c_min,'indicator']=1
    print(df)
    

    Output:

    df
         a    b     c                  dt  e  indicator
    0   35  0.1   234 2020-06-15 14:27:00  0          0
    1    1  0.1   554 2020-06-15 15:28:00  1          0
    2    2  0.2   654 2020-06-15 16:29:00  0          1
    3   23  0.4  2345 2020-06-15 17:26:00  0          1
    4   34  0.8   245 2020-06-15 18:25:00  0          0
    5    8  0.9   123 2020-06-15 18:26:00  0          0
    6    7  0.1    22 2020-06-15 18:27:00  0          0
    7    2  0.3    99 2020-06-15 18:28:00  0          1
    8  219  0.2    17 2020-06-15 19:26:00  0          0
    
    0 讨论(0)
提交回复
热议问题