问题
I am trying to find an optimal distribution curve fit to my data consisting of
y-axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44,
0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05,
0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]
y-axis are probabilities of an event occurring in x-axis time bins:
x-axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0,
12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0,
22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0,
32.0, 33.0, 34.0]
I am doing this in python following example provided on Fitting empirical distribution to theoretical ones with Scipy (Python)?
Specifically I am attempting to recreate the part called 'Distribution Fitting with Sum of Square Error (SSE)', where you run through the different distributions to find the right fit to the data.
How can I modify that example in order to make this work on my data inputs? answered
Update version based on Bill's response, but now trying to plot the fitted curve against the data and seeing something off:
%matplotlib inline
import matplotlib.pyplot as plt
import scipy
import scipy.stats
import numpy as np
from scipy.stats import gamma, lognorm, loglaplace
from scipy.optimize import curve_fit
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]
matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')
def f(x, a, loc, scale):
return gamma.pdf(x, a, loc, scale)
result, pcov = curve_fit(f, x_axis, y_axis)
# get curve shape, location, scale
shape = result[:-2]
loc = result[-2]
scale = result[-1]
# construct the curve
x = np.linspace(0, 36, 100)
y = f(x, *result)
plt.bar(x_axis, y_axis, width, alpha=0.75)
plt.plot(x, y, c='g')
回答1:
Your situation is not the same as that in the one treated in the question you cited. You have both the ordinates and the abscissae of the data points, rather than the usual i.i.d. sample. I would suggest that you use scipy curve_fit
. Here's a sample.
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]
## y_axis values must be normalised
sum_ys = sum(y_axis)
y_axis = [_/sum_ys for _ in y_axis]
print (sum(y_axis))
from scipy.stats import gamma, norm
from scipy.optimize import curve_fit
def gamma_f(x, a, loc, scale):
return gamma.pdf(x, a, loc, scale)
def norm_f(x, loc, scale):
return norm.pdf(x, loc, scale)
fitting = norm_f
result = curve_fit(fitting, x_axis, y_axis)
print (result)
import matplotlib.pyplot as plt
plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, [fitting(_, *result[0]) for _ in x_axis], 'b-')
plt.axis([0,35,0,.5])
plt.show()
This version shows how to do one plot, for the normal fit to the data. (The gamma provides a poor fit.) Only two parameters are needed for the normal. In general you would need only the first part of the output results, the estimates of the parameters, shape, location and scale.
(array([ 2.3352639 , -3.08105104, 10.15024823]), array([[ 5954.86532869, -27818.92220973, -19675.22421994],
[ -27818.92220973, 133161.76500251, 90741.43608615],
[ -19675.22421994, 90741.43608615, 66054.79087992]]))
Notice that the pdf of the gamma distribution is also available in scipy, as are the others that you need, I think, saving you the work of coding them.
The most important thing I omitted from the first code was the need to normalise the y-values, that is, to make them sum to one, since they should approximate histogram heights.
回答2:
I tried your example using OpenTURNS platform Here what I got.
I started with the same data as you after importing openturns and openturs.viewer.View for plotting
import openturns as ot
from openturns.viewer import View
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0,
12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0,
22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0,
32.0, 33.0, 34.0]
y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44,
0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05,
0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]
First step: we can define the corresponding distribution
distribution = ot.UserDefined(ot.Sample([[s] for s in x_axis]), y_axis)
graph = distribution.drawPDF()
graph.setColors(["black"])
graph.setLegends(["your input"])
at this stage, if you View(graph)
you would get:
Second step: we can derive a sample from the obtained distibution
sample = distribution.getSample(10000)
this sample will be used to fit any kind of distributions. I tried with WeibullMin and Gamma distributions
# WeibullMin Factory
distribution2 = ot.WeibullMinFactory().build(sample)
print(distribution2)
graph2 = distribution2.drawPDF() ; graph2.setLegends(["Best WeibullMin"])
>>> WeibullMin(beta = 8.83969, alpha = 1.48142, gamma = 4.76832)
# Gamma Factory
distribution3 = ot.GammaFactory().build(sample)
print(distribution3)
>>> Gamma(k = 2.08142, lambda = 0.25157, gamma = 4.9995)
graph3 = distribution3.drawPDF() ; graph3.setLegends(["Best Gamma"]) ;
graph3.setColors(["blue"])
# plotting all the results
graph.add(graph2) ; graph.add(graph3)
View(graph)
来源:https://stackoverflow.com/questions/43151324/python-distribution-fitting-with-sum-of-square-error-sse