How to estimate density function and calculate its peaks?

后端 未结 1 1165
天涯浪人
天涯浪人 2021-02-04 11:33

I have started to use python for analysis. I would like to do the following:

  1. Get the distribution of dataset
  2. Get the peaks in this distribution
相关标签:
1条回答
  • 2021-02-04 12:14

    I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.

    import pandas as pd
    import numpy as np
    import scipy.stats as stats
    import matplotlib.pyplot as plt
    
    np.random.seed(0)
    gaussian1 = -6 + 3 * np.random.randn(1700)
    gaussian2 = 4 + 1.5 * np.random.randn(300)
    gaussian_mixture = np.hstack([gaussian1, gaussian2])
    
    df = pd.DataFrame(gaussian_mixture, columns=['data'])
    
    # non-parametric pdf
    nparam_density = stats.kde.gaussian_kde(df.values.ravel())
    x = np.linspace(-20, 10, 200)
    nparam_density = nparam_density(x)
    
    # parametric fit: assume normal distribution
    loc_param, scale_param = stats.norm.fit(df)
    param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
    
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.hist(df.values, bins=30, normed=True)
    ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
    ax.plot(x, param_density, 'k--', label='parametric density')
    ax.set_ylim([0, 0.15])
    ax.legend(loc='best')
    

    enter image description here

    From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.

    To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.

    # get mode: exhastive search
    x[np.argsort(nparam_density)[-1]]
    
    0 讨论(0)
提交回复
热议问题