Pandas how to use pd.cut()

后端 未结 5 470
时光取名叫无心
时光取名叫无心 2020-12-02 17:28

Here is the snippet:

test = pd.DataFrame({\'days\': [0,31,45]})
test[\'range\'] = pd.cut(test.days, [0,30,60])

Output:

             


        
相关标签:
5条回答
  • 2020-12-02 17:42

    You can use labels to pd.cut() as well. The following example contains the grade of students in the range from 0-10. We're adding a new column called 'grade_cat' to categorize the grades.

    bins represent the intervals: 0-4 is one interval, 5-6 is one interval, and so on The corresponding labels are "poor", "normal", etc

    bins = [0, 4, 6, 10]
    labels = ["poor","normal","excellent"]
    student['grade_cat'] = pd.cut(student['grade'], bins=bins, labels=labels)
    
    0 讨论(0)
  • 2020-12-02 18:04

    A sample of how the .cut works

    s=pd.Series([168,180,174,190,170,185,179,181,175,169,182,177,180,171)
        pd.cut(s,3)
        #To add labels to bins
        pd.cut(s,3,labels=["Small","Medium","Large"])
    

    This can be used directly on a range

    0 讨论(0)
  • @jezrael has explained almost all the use-cases of pd.cut()

    One use-case that i would like to add is the following

    pd.cut(np.array([1,2,3,4,5,6]),3)

    the number of bins are decided by the second parameter, thus we have following output

    [(0.995,2.667],(0.995,2.667],(2.667,4.333],(2.667,4.333], (4.333,6.0], (4.333,6.0]]
    Categories (3, interval[float64]): [(0.995,2.667] < (2.667,4.333] < (4.333,6.0]]
    

    Similarly if we use the number of bin parameter(second parameter) as 2 following will be the output

    [(0.995, 3.5], (0.995, 3.5], (0.995, 3.5], (3.5, 6.0], (3.5, 6.0], (3.5, 6.0]]
    Categories (2, interval[float64]): [(0.995, 3.5] < (3.5, 6.0]]
    
    0 讨论(0)
  • 2020-12-02 18:06

    pd.cut documentation
    Include parameter right=False

    test = pd.DataFrame({'days': [0,31,45]})
    test['range'] = pd.cut(test.days, [0,30,60], right=False)
    
    test
    
       days     range
    0     0   [0, 30)
    1    31  [30, 60)
    2    45  [30, 60)
    
    0 讨论(0)
  • 2020-12-02 18:08
    test['range'] = pd.cut(test.days, [0,30,60], include_lowest=True)
    print (test)
       days           range
    0     0  (-0.001, 30.0]
    1    31    (30.0, 60.0]
    2    45    (30.0, 60.0]
    

    See difference:

    test = pd.DataFrame({'days': [0,20,30,31,45,60]})
    
    test['range1'] = pd.cut(test.days, [0,30,60], include_lowest=True)
    #30 value is in [30, 60) group
    test['range2'] = pd.cut(test.days, [0,30,60], right=False)
    #30 value is in (0, 30] group
    test['range3'] = pd.cut(test.days, [0,30,60])
    print (test)
       days          range1    range2    range3
    0     0  (-0.001, 30.0]   [0, 30)       NaN
    1    20  (-0.001, 30.0]   [0, 30)   (0, 30]
    2    30  (-0.001, 30.0]  [30, 60)   (0, 30]
    3    31    (30.0, 60.0]  [30, 60)  (30, 60]
    4    45    (30.0, 60.0]  [30, 60)  (30, 60]
    5    60    (30.0, 60.0]       NaN  (30, 60]
    

    Or use numpy.searchsorted, but values of days hast to be sorted:

    arr = np.array([0,30,60])
    test['range1'] = arr.searchsorted(test.days)
    test['range2'] = arr.searchsorted(test.days, side='right') - 1
    print (test)
       days  range1  range2
    0     0       0       0
    1    20       1       0
    2    30       1       1
    3    31       2       1
    4    45       2       1
    5    60       2       2
    
    0 讨论(0)
提交回复
热议问题