I have an array with values, and I want to create a histogram of it. I am mainly interested in the low end numbers, and want to collect every number above 300 in one bin. This bin should have the same width as all other (equally wide) bins. How can I do this?
Note: this question is related to this question: Defining bin width/x-axis scale in Matplotlib histogram
This is what I tried so far:
import matplotlib.pyplot as plt
import numpy as np
def plot_histogram_01():
np.random.seed(1)
values_A = np.random.choice(np.arange(600), size=200, replace=True).tolist()
values_B = np.random.choice(np.arange(600), size=200, replace=True).tolist()
bins = [0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 600]
fig, ax = plt.subplots(figsize=(9, 5))
_, bins, patches = plt.hist([values_A, values_B], normed=1, # normed is deprecated and will be replaced by density
bins=bins,
color=['#3782CC', '#AFD5FA'],
label=['A', 'B'])
xlabels = np.array(bins[1:], dtype='|S4')
xlabels[-1] = '300+'
N_labels = len(xlabels)
plt.xlim([0, 600])
plt.xticks(25 * np.arange(N_labels) + 12.5)
ax.set_xticklabels(xlabels)
plt.yticks([])
plt.title('')
plt.setp(patches, linewidth=0)
plt.legend()
fig.tight_layout()
plt.savefig('my_plot_01.png')
plt.close()
This is the result, which does not look nice:
I then changed the line with xlim in it:
plt.xlim([0, 325])
With the following result:
It looks more or less as I want it, but the last bin is not visible now. Which trick am I missing to visualize this last bin with a width of 25?
Numpy has a handy function for dealing with this: np.clip
. Despite what the name may sound like, it doesn't remove values, it just limits them to the range you specify. Basically, it does Artem's "dirty hack" inline. You can leave the values as they are, but in the hist
call, just wrap the array in an np.clip
call, like so
plt.hist(np.clip(values_A, bins[0], bins[-1]), bins=bins)
This is nicer for a number of reasons:
It's way faster — at least for large numbers of elements. Numpy does its work at the C level. Operating on python lists (as in Artem's list comprehension) has a lot of overhead for each element. Basically, if you ever have the option to use numpy, you should.
You do it right where it's needed, which reduces the chance of making mistakes in your code.
You don't need to keep a second copy of the array hanging around, which reduces memory usage (except within this one line) and further reduces the chances of making mistakes.
Using
bins[0], bins[-1]
instead of hard-coding the values reduces the chances of making mistakes again, because you can change the bins just wherebins
was defined; you don't need to remember to change them in the call toclip
or anywhere else.
So to put it all together as in the OP:
import matplotlib.pyplot as plt
import numpy as np
def plot_histogram_01():
np.random.seed(1)
values_A = np.random.choice(np.arange(600), size=200, replace=True)
values_B = np.random.choice(np.arange(600), size=200, replace=True)
bins = np.arange(0,350,25)
fig, ax = plt.subplots(figsize=(9, 5))
_, bins, patches = plt.hist([np.clip(values_A, bins[0], bins[-1]),
np.clip(values_B, bins[0], bins[-1])],
# normed=1, # normed is deprecated; replace with density
density=True,
bins=bins, color=['#3782CC', '#AFD5FA'], label=['A', 'B'])
xlabels = bins[1:].astype(str)
xlabels[-1] += '+'
N_labels = len(xlabels)
plt.xlim([0, 325])
plt.xticks(25 * np.arange(N_labels) + 12.5)
ax.set_xticklabels(xlabels)
plt.yticks([])
plt.title('')
plt.setp(patches, linewidth=0)
plt.legend(loc='upper left')
fig.tight_layout()
plot_histogram_01()
Sorry I am not familiar with matplotlib. So I have a dirty hack for you. I just put all values that greater than 300 in one bin and changed the bin size.
The root of the problem is that matplotlib tries to put all bins on the plot. In R I would convert my bins to factor variable, so they are not treated as real numbers.
import matplotlib.pyplot as plt
import numpy as np
def plot_histogram_01():
np.random.seed(1)
values_A = np.random.choice(np.arange(600), size=200, replace=True).tolist()
values_B = np.random.choice(np.arange(600), size=200, replace=True).tolist()
values_A_to_plot = [301 if i > 300 else i for i in values_A]
values_B_to_plot = [301 if i > 300 else i for i in values_B]
bins = [0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325]
fig, ax = plt.subplots(figsize=(9, 5))
_, bins, patches = plt.hist([values_A_to_plot, values_B_to_plot], normed=1, # normed is deprecated and will be replaced by density
bins=bins,
color=['#3782CC', '#AFD5FA'],
label=['A', 'B'])
xlabels = np.array(bins[1:], dtype='|S4')
xlabels[-1] = '300+'
N_labels = len(xlabels)
plt.xticks(25 * np.arange(N_labels) + 12.5)
ax.set_xticklabels(xlabels)
plt.yticks([])
plt.title('')
plt.setp(patches, linewidth=0)
plt.legend()
fig.tight_layout()
plt.savefig('my_plot_01.png')
plt.close()
plot_histogram_01()
来源:https://stackoverflow.com/questions/26218704/matplotlib-histogram-with-collection-bin-for-high-values