问题
In reading and experimenting with numpy.random, I can't seem to find or create what I need; a 10 parameter Python pseudo-random value generator including count, min, max, mean, sd, 25th%ile, 50th%ile (median), 75th%ile, skew, and kurtosis.
From https://docs.python.org/3/library/random.html I see these distributions uniform, normal (Gaussian), lognormal, negative exponential, gamma, and beta distributions, though I need to generate values directly to a distribution defined only by my 10 parameters, with no reference to a distribution family.
Is there documentation, or an author(s), of a numpy.random.xxxxxx(n, min, max, mean, sd, 25%, 50%, 75%, skew, kurtosis), or what please is the closest existing source code that I might modify to achieve this goal?
This would be the reverse of describe() including skew and kurtosis in a way. I could do a loop or optimize until a criteria is met with random generated numbers, though that could take an infinite amount of time to meet my 10 parameters.
I have found optim in R which generates a data set, but have so far been able to increase the parameters in the R optim source code or duplicate it with Python scipy.optimize or similar, though these still depend on methods instead of directly psudo-randomly creating a data set according to my 10 parameters as I need to;
m0 <- 20
sd0 <- 5
min <- 1
max <- 45
n <- 15
set.seed(1)
mm <- min:max
x0 <- sample(mm, size=n, replace=TRUE)
objfun <- function(x) {(mean(x)-m0)^2+(sd(x)-sd0)^2}
candfun <- function(x) {x[sample(n, size=1)] <- sample(mm, size=1)
return(x)}
objfun(x0) ##INITIAL RESULT:83.93495
o1 <- optim(par=x0, fn=objfun, gr=candfun, method="SANN", control=list(maxit=1e6))
mean(o1$par) ##INITIAL RESULT:20
sd(o1$par) ##INITIAL RESULT:5
plot(table(o1$par))
回答1:
The most general way to generate a random number following a distribution is as follows:
- Generate a uniform random number bounded by 0 and 1 (e.g.,
numpy.random.random()
). - Take the inverse CDF (inverse cumulative distribution function) of that number.
The result is a number that follows the distribution.
In your case, the inverse CDF (ICDF(x)
) is determined already by five of your parameters -- the minimum, maximum, and three percentiles, as follows:
- ICDF(0) = minimum
- ICDF(0.25) = 25th percentile
- ICDF(0.5) = 50th percentile
- ICDF(0.75) = 75th percentile
- ICDF(1) = maximum
Thus, you already have some idea of how the inverse CDF looks like. And all you have to do now is somehow optimize the inverse CDF for the other parameters (mean, standard deviation, skewness, and kurtosis). For example, you can "fill in" the inverse CDF at the other percentiles and see how well they match the parameters you're going after. In this sense, a good starting guess is a linear interpolation of the percentiles just mentioned. Another thing to keep in mind is that the inverse CDF "can never go down".
The following code shows a solution. It does the following steps:
- It calculates an initial guess for the inverse CDF via a linear interpolation. The initial guess consists of the values of that function at 101 evenly spaced points, including the five percentiles mentioned above.
- It sets up the bounds of the optimization. The optimization is bounded by the minimum and maximum values everywhere except at the five percentiles.
- It sets up the other four parameters.
- It then passes the objective function (
_lossfunc
), the initial guess, the bounds and the other parameters to SciPy'sscipy.optimize.minimize
method for optimization. - Once the optimization finishes, the code checks for success and raises an error if unsuccessful.
- If the optimization succeeds, the code calculates an inverse CDF for the final result.
- It generates N uniform random values.
- It transforms those values with the inverse CDF and returns those values.
import scipy.stats.mstats as mst
from scipy.optimize import minimize
from scipy.interpolate import interp1d
import numpy
# Define the loss function, which compares the calculated
# and ideal parameters
def _lossfunc(x, *args):
mean, stdev, skew, kurt, chunks = args
st = (
(numpy.mean(x) - mean) ** 2
+ (numpy.sqrt(numpy.var(x)) - stdev) ** 2
+ ((mst.skew(x) - skew)) ** 2
+ ((mst.kurtosis(x) - kurt)) ** 2
)
return st
def adjust(rx, percentiles):
eps = (max(rx) - min(rx)) / (3.0 * len(rx))
# Make result monotonic
for i in range(1, len(rx)):
if (
i - 2 >= 0
and rx[i - 2] < rx[i - 1]
and rx[i - 1] >= rx[i]
and rx[i - 2] < rx[i]
):
rx[i - 1] = (rx[i - 2] + rx[i]) / 2.0
elif rx[i - 1] >= rx[i]:
rx[i] = rx[i - 1] + eps
# Constrain to percentiles
for pi in range(1, len(percentiles)):
previ = percentiles[pi - 1][0]
prev = rx[previ]
curr = rx[percentiles[pi][0]]
prevideal = percentiles[pi - 1][1]
currideal = percentiles[pi][1]
realrange = max(eps, curr - prev)
idealrange = max(eps, currideal - prevideal)
for i in range(previ + 1, percentiles[pi][0]):
if rx[i] >= currideal or rx[i] <= prevideal:
rx[i] = (
prevideal
+ max(eps * (i - previ + 1 + 1), rx[i] - prev) * idealrange / realrange
)
rx[percentiles[pi][0]] = currideal
# Make monotonic again
for pi in range(1, len(percentiles)):
previ = percentiles[pi - 1][0]
curri = percentiles[pi][0]
for i in range(previ+1, curri+1):
if (
i - 2 >= 0
and rx[i - 2] < rx[i - 1]
and rx[i - 1] >= rx[i]
and rx[i - 2] < rx[i]
and i-1!=previ and i-1!=curri
):
rx[i - 1] = (rx[i - 2] + rx[i]) / 2.0
elif rx[i - 1] >= rx[i] and i!=curri:
rx[i] = rx[i - 1] + eps
return rx
# Calculates an inverse CDF for the given nine parameters.
def _get_inverse_cdf(mn, p25, p50, p75, mx, mean, stdev, skew, kurt, chunks=100):
if chunks < 0:
raise ValueError
# Minimum of 16 chunks
chunks = max(16, chunks)
# Round chunks up to closest multiple of 4
if chunks % 4 != 0:
chunks += 4 - (chunks % 4)
# Calculate initial guess for the inverse CDF; an
# interpolation of the inverse CDF through the known
# percentiles
interp = interp1d([0, 0.25, 0.5, 0.75, 1.0], [mn, p25, p50, p75, mx], kind="cubic")
rnge = mx - mn
x = interp(numpy.linspace(0, 1, chunks + 1))
# Bounds, taking percentiles into account
bounds = [(mn, mx) for i in range(chunks + 1)]
percentiles = [
[0, mn],
[int(chunks * 1 / 4), p25],
[int(chunks * 2 / 4), p50],
[int(chunks * 3 / 4), p75],
[int(chunks), mx],
]
for p in percentiles:
bounds[p[0]] = (p[1], p[1])
# Other parameters
otherParams = (mean, stdev, skew, kurt, chunks)
# Optimize the result for the given parameters
# using the initial guess and the bounds
result = minimize(
_lossfunc, # Loss function
x, # Initial guess
otherParams, # Arguments
bounds=bounds,
)
rx = result.x
if result.success:
adjust(rx, percentiles)
# Minimize again
result = minimize(
_lossfunc, # Loss function
rx, # Initial guess
otherParams, # Arguments
bounds=bounds,
)
rx = result.x
adjust(rx, percentiles)
# Minimize again
result = minimize(
_lossfunc, # Loss function
rx, # Initial guess
otherParams, # Arguments
bounds=bounds,
)
rx = result.x
# Calculate interpolating function of result
ls = numpy.linspace(0, 1, chunks + 1)
success = result.success
icdf=interp1d(ls, rx, kind="linear")
# == To check the quality of the result
if False:
meandiff = numpy.mean(rx) - mean
stdevdiff = numpy.sqrt(numpy.var(rx)) - stdev
print(meandiff)
print(stdevdiff)
print(mst.skew(rx)-skew)
print(mst.kurtosis(rx)-kurt)
print(icdf(0)-percentiles[0][1])
print(icdf(0.25)-percentiles[1][1])
print(icdf(0.5)-percentiles[2][1])
print(icdf(0.75)-percentiles[3][1])
print(icdf(1)-percentiles[4][1])
return (icdf, success)
def random_10params(n, mn, p25, p50, p75, mx, mean, stdev, skew, kurt):
""" Note: Kurtosis as used here is Fisher's kurtosis,
or kurtosis excess. Stdev is square root of numpy.var(). """
# Calculate inverse CDF
icdf, success = (None, False)
tries = 0
# Try up to 10 times to get a converging inverse CDF, increasing the mesh each time
chunks = 500
while tries < 10:
icdf, success = _get_inverse_cdf(mn, p25, p50, p75, mx, mean, stdev, skew, kurt,chunks=chunks)
tries+=1
chunks+=100
if success: break
if not success:
print("Warning: Estimation failed and may be inaccurate")
# Generate uniform random variables
npr=numpy.random.random(size=n)
# Transform them with the inverse CDF
return icdf(npr)
Example:
print(random_10params(n=1000, mn=39, p25=116, p50=147, p75=186, mx=401, mean=154.1207, stdev=52.3257, skew=.7083, kurt=.5383))
One last note: If you have access to the underlying data points, rather than just their statistics, there are other methods you can use to sample from the distribution those data points form. Examples include kernel density estimations, histograms, or regression models (particularly for time series data). See also Generate random data based on existing data.
来源:https://stackoverflow.com/questions/61433438/how-can-python-use-n-min-max-mean-std-25-50-75-skew-kurtosis-to-defi