How can Python use n, min, max, mean, std, 25%, 50%, 75%, Skew, Kurtosis to define a psudo-random Probability Density Estimate/Function?

问题

In reading and experimenting with numpy.random, I can't seem to find or create what I need; a 10 parameter Python pseudo-random value generator including count, min, max, mean, sd, 25th%ile, 50th%ile (median), 75th%ile, skew, and kurtosis.

From https://docs.python.org/3/library/random.html I see these distributions uniform, normal (Gaussian), lognormal, negative exponential, gamma, and beta distributions, though I need to generate values directly to a distribution defined only by my 10 parameters, with no reference to a distribution family.

Is there documentation, or an author(s), of a numpy.random.xxxxxx(n, min, max, mean, sd, 25%, 50%, 75%, skew, kurtosis), or what please is the closest existing source code that I might modify to achieve this goal?

This would be the reverse of describe() including skew and kurtosis in a way. I could do a loop or optimize until a criteria is met with random generated numbers, though that could take an infinite amount of time to meet my 10 parameters.

I have found optim in R which generates a data set, but have so far been able to increase the parameters in the R optim source code or duplicate it with Python scipy.optimize or similar, though these still depend on methods instead of directly psudo-randomly creating a data set according to my 10 parameters as I need to;

m0 <- 20
sd0 <- 5
min <- 1
max <- 45
n <- 15
set.seed(1)
mm <- min:max
x0 <- sample(mm, size=n, replace=TRUE)
objfun <- function(x) {(mean(x)-m0)^2+(sd(x)-sd0)^2}
candfun <- function(x) {x[sample(n, size=1)] <- sample(mm, size=1)
    return(x)}
objfun(x0) ##INITIAL RESULT:83.93495
o1 <- optim(par=x0, fn=objfun, gr=candfun, method="SANN", control=list(maxit=1e6))
mean(o1$par) ##INITIAL RESULT:20
sd(o1$par) ##INITIAL RESULT:5
plot(table(o1$par))

回答1:

The most general way to generate a random number following a distribution is as follows:

Generate a uniform random number bounded by 0 and 1 (e.g., numpy.random.random()).
Take the inverse CDF (inverse cumulative distribution function) of that number.

The result is a number that follows the distribution.

In your case, the inverse CDF (ICDF(x)) is determined already by five of your parameters -- the minimum, maximum, and three percentiles, as follows:

ICDF(0) = minimum
ICDF(0.25) = 25th percentile
ICDF(0.5) = 50th percentile
ICDF(0.75) = 75th percentile
ICDF(1) = maximum

Thus, you already have some idea of how the inverse CDF looks like. And all you have to do now is somehow optimize the inverse CDF for the other parameters (mean, standard deviation, skewness, and kurtosis). For example, you can "fill in" the inverse CDF at the other percentiles and see how well they match the parameters you're going after. In this sense, a good starting guess is a linear interpolation of the percentiles just mentioned. Another thing to keep in mind is that the inverse CDF "can never go down".

The following code shows a solution. It does the following steps:

It calculates an initial guess for the inverse CDF via a linear interpolation. The initial guess consists of the values of that function at 101 evenly spaced points, including the five percentiles mentioned above.
It sets up the bounds of the optimization. The optimization is bounded by the minimum and maximum values everywhere except at the five percentiles.
It sets up the other four parameters.
It then passes the objective function (_lossfunc), the initial guess, the bounds and the other parameters to SciPy's scipy.optimize.minimize method for optimization.
Once the optimization finishes, the code checks for success and raises an error if unsuccessful.
If the optimization succeeds, the code calculates an inverse CDF for the final result.
It generates N uniform random values.
It transforms those values with the inverse CDF and returns those values.

import scipy.stats.mstats as mst
from scipy.optimize import minimize
from scipy.interpolate import interp1d
import numpy

# Define the loss function, which compares the calculated
# and ideal parameters
def _lossfunc(x, *args):
    mean, stdev, skew, kurt, chunks = args
    st = (
        (numpy.mean(x) - mean) ** 2
        + (numpy.sqrt(numpy.var(x)) - stdev) ** 2
        + ((mst.skew(x) - skew)) ** 2
        + ((mst.kurtosis(x) - kurt)) ** 2
    )
    return st

def adjust(rx, percentiles):
    eps = (max(rx) - min(rx)) / (3.0 * len(rx))
    # Make result monotonic
    for i in range(1, len(rx)):
        if (
            i - 2 >= 0
            and rx[i - 2] < rx[i - 1]
            and rx[i - 1] >= rx[i]
            and rx[i - 2] < rx[i]
        ):
            rx[i - 1] = (rx[i - 2] + rx[i]) / 2.0
        elif rx[i - 1] >= rx[i]:
            rx[i] = rx[i - 1] + eps
    # Constrain to percentiles
    for pi in range(1, len(percentiles)):
        previ = percentiles[pi - 1][0]
        prev = rx[previ]
        curr = rx[percentiles[pi][0]]
        prevideal = percentiles[pi - 1][1]
        currideal = percentiles[pi][1]
        realrange = max(eps, curr - prev)
        idealrange = max(eps, currideal - prevideal)
        for i in range(previ + 1, percentiles[pi][0]):
            if rx[i] >= currideal or rx[i] <= prevideal:
              rx[i] = (
                  prevideal
                  + max(eps * (i - previ + 1 + 1), rx[i] - prev) * idealrange / realrange
              )
        rx[percentiles[pi][0]] = currideal
    # Make monotonic again
    for pi in range(1, len(percentiles)):
        previ = percentiles[pi - 1][0]
        curri = percentiles[pi][0]
        for i in range(previ+1, curri+1):
          if (
            i - 2 >= 0
            and rx[i - 2] < rx[i - 1]
            and rx[i - 1] >= rx[i]
            and rx[i - 2] < rx[i]
            and i-1!=previ and i-1!=curri
          ):
            rx[i - 1] = (rx[i - 2] + rx[i]) / 2.0
          elif rx[i - 1] >= rx[i] and i!=curri:
            rx[i] = rx[i - 1] + eps
    return rx

# Calculates an inverse CDF for the given nine parameters.
def _get_inverse_cdf(mn, p25, p50, p75, mx, mean, stdev, skew, kurt, chunks=100):
    if chunks < 0:
        raise ValueError
    # Minimum of 16 chunks
    chunks = max(16, chunks)
    # Round chunks up to closest multiple of 4
    if chunks % 4 != 0:
        chunks += 4 - (chunks % 4)
    # Calculate initial guess for the inverse CDF; an
    # interpolation of the inverse CDF through the known
    # percentiles
    interp = interp1d([0, 0.25, 0.5, 0.75, 1.0], [mn, p25, p50, p75, mx], kind="cubic")
    rnge = mx - mn
    x = interp(numpy.linspace(0, 1, chunks + 1))
    # Bounds, taking percentiles into account
    bounds = [(mn, mx) for i in range(chunks + 1)]
    percentiles = [
        [0, mn],
        [int(chunks * 1 / 4), p25],
        [int(chunks * 2 / 4), p50],
        [int(chunks * 3 / 4), p75],
        [int(chunks), mx],
    ]
    for p in percentiles:
        bounds[p[0]] = (p[1], p[1])
    # Other parameters
    otherParams = (mean, stdev, skew, kurt, chunks)
    # Optimize the result for the given parameters
    # using the initial guess and the bounds
    result = minimize(
        _lossfunc,  # Loss function
        x,  # Initial guess
        otherParams,  # Arguments
        bounds=bounds,
    )
    rx = result.x
    if result.success:
        adjust(rx, percentiles)
        # Minimize again
        result = minimize(
            _lossfunc,  # Loss function
            rx,  # Initial guess
            otherParams,  # Arguments
            bounds=bounds,
        )
        rx = result.x
        adjust(rx, percentiles)
        # Minimize again
        result = minimize(
            _lossfunc,  # Loss function
            rx,  # Initial guess
            otherParams,  # Arguments
            bounds=bounds,
        )
        rx = result.x
    # Calculate interpolating function of result
    ls = numpy.linspace(0, 1, chunks + 1)
    success = result.success
    icdf=interp1d(ls, rx, kind="linear")
    # == To check the quality of the result
    if False:
       meandiff = numpy.mean(rx) - mean
       stdevdiff = numpy.sqrt(numpy.var(rx)) - stdev
       print(meandiff)
       print(stdevdiff)
       print(mst.skew(rx)-skew)
       print(mst.kurtosis(rx)-kurt)
       print(icdf(0)-percentiles[0][1])
       print(icdf(0.25)-percentiles[1][1])
       print(icdf(0.5)-percentiles[2][1])
       print(icdf(0.75)-percentiles[3][1])
       print(icdf(1)-percentiles[4][1])
    return (icdf, success)

def random_10params(n, mn, p25, p50, p75, mx, mean, stdev, skew, kurt):
   """ Note: Kurtosis as used here is Fisher's kurtosis, 
     or kurtosis excess. Stdev is square root of numpy.var(). """
   # Calculate inverse CDF
   icdf, success = (None, False)
   tries = 0
   # Try up to 10 times to get a converging inverse CDF, increasing the mesh each time
   chunks = 500
   while tries < 10:
      icdf, success = _get_inverse_cdf(mn, p25, p50, p75, mx, mean, stdev, skew, kurt,chunks=chunks)
      tries+=1
      chunks+=100
      if success: break
   if not success:
     print("Warning: Estimation failed and may be inaccurate")
   # Generate uniform random variables
   npr=numpy.random.random(size=n)
   # Transform them with the inverse CDF
   return icdf(npr)

Example:

print(random_10params(n=1000, mn=39, p25=116, p50=147, p75=186, mx=401, mean=154.1207, stdev=52.3257, skew=.7083, kurt=.5383))

One last note: If you have access to the underlying data points, rather than just their statistics, there are other methods you can use to sample from the distribution those data points form. Examples include kernel density estimations, histograms, or regression models (particularly for time series data). See also Generate random data based on existing data.

来源：https://stackoverflow.com/questions/61433438/how-can-python-use-n-min-max-mean-std-25-50-75-skew-kurtosis-to-defi

标签

python-3.x

random

probability-density