Matplotlib histogram with collection bin for high values

痞子三分冷 提交于 2019-12-02 18:08:44

Numpy has a handy function for dealing with this: np.clip. Despite what the name may sound like, it doesn't remove values, it just limits them to the range you specify. Basically, it does Artem's "dirty hack" inline. You can leave the values as they are, but in the hist call, just wrap the array in an np.clip call, like so

plt.hist(np.clip(values_A, bins[0], bins[-1]), bins=bins)

This is nicer for a number of reasons:

  1. It's way faster — at least for large numbers of elements. Numpy does its work at the C level. Operating on python lists (as in Artem's list comprehension) has a lot of overhead for each element. Basically, if you ever have the option to use numpy, you should.

  2. You do it right where it's needed, which reduces the chance of making mistakes in your code.

  3. You don't need to keep a second copy of the array hanging around, which reduces memory usage (except within this one line) and further reduces the chances of making mistakes.

  4. Using bins[0], bins[-1] instead of hard-coding the values reduces the chances of making mistakes again, because you can change the bins just where bins was defined; you don't need to remember to change them in the call to clip or anywhere else.

So to put it all together as in the OP:

import matplotlib.pyplot as plt
import numpy as np

def plot_histogram_01():
    np.random.seed(1)
    values_A = np.random.choice(np.arange(600), size=200, replace=True)
    values_B = np.random.choice(np.arange(600), size=200, replace=True)

    bins = np.arange(0,350,25)

    fig, ax = plt.subplots(figsize=(9, 5))
    _, bins, patches = plt.hist([np.clip(values_A, bins[0], bins[-1]),
                                 np.clip(values_B, bins[0], bins[-1])],
                                # normed=1,  # normed is deprecated; replace with density
                                density=True,
                                bins=bins, color=['#3782CC', '#AFD5FA'], label=['A', 'B'])

    xlabels = bins[1:].astype(str)
    xlabels[-1] += '+'

    N_labels = len(xlabels)
    plt.xlim([0, 325])
    plt.xticks(25 * np.arange(N_labels) + 12.5)
    ax.set_xticklabels(xlabels)

    plt.yticks([])
    plt.title('')
    plt.setp(patches, linewidth=0)
    plt.legend(loc='upper left')

    fig.tight_layout()
plot_histogram_01()

Sorry I am not familiar with matplotlib. So I have a dirty hack for you. I just put all values that greater than 300 in one bin and changed the bin size.

The root of the problem is that matplotlib tries to put all bins on the plot. In R I would convert my bins to factor variable, so they are not treated as real numbers.

import matplotlib.pyplot as plt
import numpy as np

def plot_histogram_01():
    np.random.seed(1)
    values_A = np.random.choice(np.arange(600), size=200, replace=True).tolist()
    values_B = np.random.choice(np.arange(600), size=200, replace=True).tolist()
    values_A_to_plot = [301 if i > 300 else i for i in values_A]
    values_B_to_plot = [301 if i > 300 else i for i in values_B]

    bins = [0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325]

    fig, ax = plt.subplots(figsize=(9, 5))
    _, bins, patches = plt.hist([values_A_to_plot, values_B_to_plot], normed=1,  # normed is deprecated and will be replaced by density
                                bins=bins,
                                color=['#3782CC', '#AFD5FA'],
                                label=['A', 'B'])

    xlabels = np.array(bins[1:], dtype='|S4')
    xlabels[-1] = '300+'

    N_labels = len(xlabels)

    plt.xticks(25 * np.arange(N_labels) + 12.5)
    ax.set_xticklabels(xlabels)

    plt.yticks([])
    plt.title('')
    plt.setp(patches, linewidth=0)
    plt.legend()

    fig.tight_layout()
    plt.savefig('my_plot_01.png')
    plt.close()

plot_histogram_01()

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!