Select DataFrame rows between two dates

前端 未结 10 684
挽巷
挽巷 2020-11-22 03:14

I am creating a DataFrame from a csv as follows:

stock = pd.read_csv(\'data_in/\' + filename + \'.csv\', skipinitialspace=True)

The DataFra

相关标签:
10条回答
  • 2020-11-22 03:45

    You can also use between:

    df[df.some_date.between(start_date, end_date)]
    
    0 讨论(0)
  • 2020-11-22 03:53

    With my testing of pandas version 0.22.0 you can now answer this question easier with more readable code by simply using between.

    # create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
    df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})
    

    Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:

    # use the between statement to get a boolean mask
    df['dates'].between('2018-11-27','2019-01-15', inclusive=False)
    
    0    False
    1    False
    2    False
    3    False
    4    False
    
    # you can pass this boolean mask straight to loc
    df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]
    
        dates
    331 2018-11-28
    332 2018-11-29
    333 2018-11-30
    334 2018-12-01
    335 2018-12-02
    

    Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:

    df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
    
        dates
    330 2018-11-27
    331 2018-11-28
    332 2018-11-29
    333 2018-11-30
    334 2018-12-01
    

    This method is also faster than the previously mentioned isin method:

    %%timeit -n 5
    df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
    868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
    
    
    %%timeit -n 5
    
    df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
    1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
    

    However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:

    # already create the mask THEN time the function
    
    start_date = dt.datetime(2018,11,27)
    end_date = dt.datetime(2019,1,15)
    mask = (df['dates'] > start_date) & (df['dates'] <= end_date)
    
    %%timeit -n 5
    df.loc[mask]
    191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
    
    0 讨论(0)
  • 2020-11-22 03:56

    I feel the best option will be to use the direct checks rather than using loc function:

    df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]
    

    It works for me.

    Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.

    0 讨论(0)
  • 2020-11-22 03:57

    Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df.

    >>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
    >>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
    >>> print(df)
          col_1       date
    0  0.015198 2020-01-01
    1  0.638600 2020-01-02
    2  0.348485 2020-01-03
    3  0.247583 2020-01-04
    4  0.581835 2020-01-05
    

    As an argument, use the condition for filtering like this:

    >>> start_date, end_date = '2020-01-02', '2020-01-04'
    >>> print(df.query('date >= @start_date and date <= @end_date'))
          col_1       date
    1  0.244104 2020-01-02
    2  0.374775 2020-01-03
    3  0.510053 2020-01-04
    

    If you do not want to include boundaries, just change the condition like following:

    >>> print(df.query('date > @start_date and date < @end_date'))
          col_1       date
    2  0.374775 2020-01-03
    
    0 讨论(0)
  • 2020-11-22 03:59

    Keeping the solution simple and pythonic, I would suggest you to try this.

    In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.

    import pandas as pd
    
    data_frame = data_frame.set_index('date')
    
    df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]
    
    0 讨论(0)
  • 2020-11-22 04:00

    There are two possible solutions:

    • Use a boolean mask, then use df.loc[mask]
    • Set the date column as a DatetimeIndex, then use df[start_date : end_date]

    Using a boolean mask:

    Ensure df['date'] is a Series with dtype datetime64[ns]:

    df['date'] = pd.to_datetime(df['date'])  
    

    Make a boolean mask. start_date and end_date can be datetime.datetimes, np.datetime64s, pd.Timestamps, or even datetime strings:

    #greater than the start date and smaller than the end date
    mask = (df['date'] > start_date) & (df['date'] <= end_date)
    

    Select the sub-DataFrame:

    df.loc[mask]
    

    or re-assign to df

    df = df.loc[mask]
    

    For example,

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame(np.random.random((200,3)))
    df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
    mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
    print(df.loc[mask])
    

    yields

                0         1         2       date
    153  0.208875  0.727656  0.037787 2000-06-02
    154  0.750800  0.776498  0.237716 2000-06-03
    155  0.812008  0.127338  0.397240 2000-06-04
    156  0.639937  0.207359  0.533527 2000-06-05
    157  0.416998  0.845658  0.872826 2000-06-06
    158  0.440069  0.338690  0.847545 2000-06-07
    159  0.202354  0.624833  0.740254 2000-06-08
    160  0.465746  0.080888  0.155452 2000-06-09
    161  0.858232  0.190321  0.432574 2000-06-10
    

    Using a DatetimeIndex:

    If you are going to do a lot of selections by date, it may be quicker to set the date column as the index first. Then you can select rows by date using df.loc[start_date:end_date].

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame(np.random.random((200,3)))
    df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
    df = df.set_index(['date'])
    print(df.loc['2000-6-1':'2000-6-10'])
    

    yields

                       0         1         2
    date                                    
    2000-06-01  0.040457  0.326594  0.492136    # <- includes start_date
    2000-06-02  0.279323  0.877446  0.464523
    2000-06-03  0.328068  0.837669  0.608559
    2000-06-04  0.107959  0.678297  0.517435
    2000-06-05  0.131555  0.418380  0.025725
    2000-06-06  0.999961  0.619517  0.206108
    2000-06-07  0.129270  0.024533  0.154769
    2000-06-08  0.441010  0.741781  0.470402
    2000-06-09  0.682101  0.375660  0.009916
    2000-06-10  0.754488  0.352293  0.339337
    

    While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.


    Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).

    0 讨论(0)
提交回复
热议问题