Creating a mixture of probability distributions for sampling

问题

Is there a general way to join SciPy (or NumPy) probability distributions to create a mixture probability distribution which can then be sampled from?

I have such a distribution for display using something like:

mixture_gaussian = (norm.pdf(x_axis, -3, 1) + norm.pdf(x_axis, 3, 1)) / 2

which if then plotted looks like:

However, I can't sample from this generated model, as it's just a list of points which will plot as the curve.

Note, this specific distribution is just a simple example. I'd like to be able to generate several kinds of distributions (including "sub"-distributions which are not just normal distributions). Ideally, I would hope there would be someway for the function to be automatically normalized (i.e. not having to do the / 2 explicitly as in the code above.

Does SciPy/NumPy provide some way of easily accomplishing this?

This answer provides a way that such a sampling from a multiple distributions could be done, but it certainly requires a bit of handcrafting for a given mixture distribution, especially when wanting to weight different "sub"-distributions differently. This is usable, but I would hope for method that's a bit cleaner and straight forward if possible. Thanks!

回答1:

Sampling from a mixture of distributions (where PDFs are added with some coefficients c_1, c_2, ... c_n) is equivalent to sampling each independently, and then, for each index, picking the value from k-th sample, with probability c_k.

The latter, mixing, step can be efficiently done with numpy.random.choice. Here is an example where three distributions are mixed. The distributions are listed in distributions, and their coefficients in coefficients. There is a fat normal distribution, a uniform distribution, and a narrow normal distribution, with coefficients 0.5, 0.2, 0.3. The mixing happens at data[np.arange(sample_size), random_idx] after random_idx are generated according to given coefficients.

import numpy as np
import matplotlib.pyplot as plt

distributions = [
    {"type": np.random.normal, "kwargs": {"loc": -3, "scale": 2}},
    {"type": np.random.uniform, "kwargs": {"low": 4, "high": 6}},
    {"type": np.random.normal, "kwargs": {"loc": 2, "scale": 1}},
]
coefficients = np.array([0.5, 0.2, 0.3])
coefficients /= coefficients.sum()      # in case these did not add up to 1
sample_size = 100000

num_distr = len(distributions)
data = np.zeros((sample_size, num_distr))
for idx, distr in enumerate(distributions):
    data[:, idx] = distr["type"](size=(sample_size,), **distr["kwargs"])
random_idx = np.random.choice(np.arange(num_distr), size=(sample_size,), p=coefficients)
sample = data[np.arange(sample_size), random_idx]
plt.hist(sample, bins=100, density=True)
plt.show()

回答2:

Following @PaulPanzer's pointer in the comments, I created the following subclass for easily creating mixture models from the SciPy distributions. Note, the pdf is not required for my question, but it was nice for me to have.

class MixtureModel(rv_continuous):
    def __init__(self, submodels, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.submodels = submodels

    def _pdf(self, x):
        pdf = self.submodels[0].pdf(x)
        for submodel in self.submodels[1:]:
            pdf += submodel.pdf(x)
        pdf /= len(self.submodels)
        return pdf

    def rvs(self, size):
        submodel_choices = np.random.randint(len(self.submodels), size=size)
        submodel_samples = [submodel.rvs(size=size) for submodel in self.submodels]
        rvs = np.choose(submodel_choices, submodel_samples)
        return rvs

mixture_gaussian_model = MixtureModel([norm(-3, 1), norm(3, 1)])
x_axis = np.arange(-6, 6, 0.001)
mixture_pdf = mixture_gaussian_model.pdf(x_axis)
mixture_rvs = mixture_gaussian_model.rvs(10)

回答3:

The code below stores a 1000 samples from N(0,1) and 500 samples from N(7,2) in an array that can then be sampled from.

import numpy as np
from scipy import stats

d = np.concatenate((stats.norm.rvs(0.0, 1.0, 1000), stats.norm.rvs(7.0, 2.0, 500)))
np.random.choice(d, 3)  # sample 3 observations

Mixture components other than the Normal can be used (e.g., stats.poisson) and there can be arbitrary number of those.

来源：https://stackoverflow.com/questions/47759577/creating-a-mixture-of-probability-distributions-for-sampling

标签

python

numpy

scipy

probability-density