Modeling distribution of performance measurements

前端 未结 6 1159
名媛妹妹
名媛妹妹 2021-01-31 19:46

How would you mathematically model the distribution of repeated real life performance measurements - \"Real life\" meaning you are not just looping over the code in question, bu

相关标签:
6条回答
  • 2021-01-31 20:21

    Try whit gamma distribution http://en.wikipedia.org/wiki/Gamma_distribution

    From wikipedia

    The gamma distribution is frequently a probability model for waiting times; for instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.

    0 讨论(0)
  • 2021-01-31 20:26

    The problem you describe is called "Distribution Fitting" and has nothing to do with performance measurements, i.e. this is generic problem of fitting suitable distribution to any gathered/measured data sample.

    The standard process is something like that:

    1. Guess the best distribution.
    2. Run hypothesis tests to check how well it describes gathered data.
    3. Repeat 1-3 if not well enough.

    You can find interesting article describing how this can be done with open-source R software system here. I think especially useful to you may be function fitdistr.

    0 讨论(0)
  • 2021-01-31 20:31

    Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.

    If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.

    Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.

    BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.

    ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.

    ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)

    0 讨论(0)
  • 2021-01-31 20:37

    The standard for randomized Arrival times for performance modelling is either Exponential distribution or Poisson distribution (which is just the distribution of multiple Exponential distributions added together).

    0 讨论(0)
  • 2021-01-31 20:39

    Not exactly answering your question, but relevant still: Mor Harchol-Balter did a very nice analysis of the size of jobs submitted to a scheduler, The effect of heavy-tailed job size distributions on computer systems design (1999). She found that the size of jobs submitted to her distributed task assignment system took a power-law distribution, which meant that certain pieces of conventional wisdom she had assumed in the construction of her task assignment system, most importantly that the jobs should be well load balanced, had awful consequences for submitters of jobs. She's done good follor-up work on this issue.

    The broader point is, you need to ask such questions as:

    1. What happens if reasonable-seeming assumptions about the distribution of performance, such as that they take a normal distribution, break down?
    2. Are the data sets I'm looking at really representative of the problem I'm trying to solve?
    0 讨论(0)
  • 2021-01-31 20:43

    In addition to already given answers consider Empirical Distributions. I have successful experience in using empirical distributions for performance analysis of several distributed systems. The idea is very straightforward. You need to build histogram of performance measurements. Measurements should be discretized with given accuracy. When you have histogram you could do several useful things:

    • calculate the probability of any given value (you are bound by accuracy only);
    • build PDF and CDF functions for the performance measurements;
    • generate sequence of response times according to a distribution. This one is very useful for performance modeling.
    0 讨论(0)
提交回复
热议问题