Minimize the sum of errors of representative integers

前端 未结 8 2329
滥情空心
滥情空心 2021-02-07 20:15

Given n integers between [0,10000] as D1,D2...,Dn, where there may be duplicates, and n can be huge:

I want to find k distinct represent

8条回答
  •  醉酒成梦
    2021-02-07 20:47

    Although with help from the community you have managed to state your problem in a form which is understandable mathematically, you still do not supply enough information to enable me (or anyone else) to give you a definitive answer (I would have posted this as a comment, but for some reason I don't see the "add comment" option is available to me). In order to give a good answer to this problem, we need to know the relative sizes of n and k versus each other and 10000 (or the expected maximum of the Di if it isn't 10000), and whether it is critical that you attain the exact minimum (even if this requires an exorbitant amount of time for the calculation) or if a close approximation would be OK, also (and if so, how close do you need to get). In addition, in order to know what algorithm runs in minimum time, we need to have an understanding about what kind of hardware is going to run the algorithm (i.e., do we have m CPU cores to run on in parallel and what is the size of m relative to k).

    Another important piece of information is whether this problem will be solved only once, or it must be solved many times but there exists some connection between the distributions of the integers Di from one problem to the next (e.g., the integers Di are all random samples from a particular, unchanging probability distribution, or perhaps each successive problem has as its input an ever-increasing set which is the set from the previous problem plus an extra s integers).

    No reasonable algorithm for your problem should run in time which depends in a way greater than linear in n, since building a histogram of the n integers Di requires O(n) time, and the answer to the optimization problem itself is only dependent on the histogram of the integers and not on their ordering. No algorithm can run in time less than O(n), since that is the size of the input of the problem.

    A brute force search over all of the possibilities requires, (assuming that at least one of the Di is 0 and another is 10000), for small k, say k < 10, approximately O(10000k-2) time, so if log10(n) >> 4(k-2), this is the optimal algorithm (since in this case the time for the brute force search is insignificant compared to the time to read the input). It is also interesting to note that if k is very close to 10000, then a brute force search only requires O(1000010002-k) (because we can search instead over the integers which are not used as representative integers).

    If you update the definition of the problem with more information, I will attempt to edit my answer in turn.

提交回复
热议问题