问题
I have a dataframe that I want to bin (i.e., group into sub-ranges) by one column, and take the mean of the second column for each of the bins:
import pandas as pd
import numpy as np
data = pd.DataFrame(columns=['Score', 'Age'])
data.Score = [1, 1, 1, 1, 0, 1, 2, 1, 0, 1, 1, 0, 2, 1, 1, 2, 1, 0, 1, 1, -1, 1, 0, 1, 1, 0, 1, 0, -2, 1]
data.Age = [29, 59, 44, 52, 60, 53, 45, 47, 57, 54, 35, 32, 48, 31, 49, 43, 67, 32, 31, 42, 37, 45, 52, 59, 56, 57, 48, 45, 56, 31]
_, bins = np.histogram(data.Age, 10)
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
labels[0] = '{}-{}'.format(bins[0], bins[1])
binned = pd.cut(data.Age, bins=bins, labels=labels, include_lowest=True, precision=0)
df = data.groupby(binned)['Score'].mean().reset_index()
df
There are 2 issues with this binning:
- there is a gap of 1 between the upper bound of the
(n-1)
th bin and the lower bound of then
th bin (which means the binning is not continuous, and data points that lie in this gap are skipped). - the last few bin limits have a lot of digits after the decimal place. I have used the
precision=0
flag in thecut
, but it seems to be of no use - no matter whatx
I use inprecision=x
, it still produces the bins with the last few bins having a lot of digits after the decimal point.
The second point causes problem when, for instance, I try to plot df
, where it ruins the look of the x-axis:
import matplotlib.pyplot as plt
plt.plot([str(i) for i in df.Age], df.Score, 'o-')
Why is this occurring inspite of the precision=0
flag that I put to imply I want only integers as the bin limits, and not floats? And how do I fix it?
I'm temporarily solving this issue by converting the bin values to int
s manually:
_, bins = np.histogram(data.Age, 10)
for i in range(len(bins)): # my fix
bins[i] = int(bins[i])
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
labels[0] = '{}-{}'.format(bins[0], bins[1])
binned = pd.cut(data.Age, bins=bins, labels=labels, include_lowest=True, precision=0)
df = data.groupby(binned)['Score'].mean().reset_index()
df
But this feels like a hack, and I think it should have a "proper" solution instead of a hacky fix. And although it fixed the second issue, I'm not sure if this fixes the first issue.
回答1:
Regarding the two issues you mentioned in your question, both of them result from one line in your code which is
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
The gab resulted from i+1
, also the digits resulted from computer approximation in the same line.
Therefore, modify it to
labels = [f'{i:.1f}-{j:.1f}' for i, j in zip(bins[:-1], bins[1:])]
in which we make an approximation to one digit.
and no need for labels[0] = '{}-{}'.format(bins[0], bins[1])
来源:https://stackoverflow.com/questions/51777825/when-using-cut-in-a-pandas-dataframe-to-bin-it-why-is-the-binning-not-properly