Minimize the sum of errors of representative integers

前端未结

关注

 8  717

走了就别回头了 2021-02-07 20:13

Given n integers between [0,10000] as D₁,D₂...,D_n, where there may be duplicates, and n can be huge:

I want to find k distinct represent

8条回答

小鲜肉 (楼主)

2021-02-07 20:34
Now the question is clarified, we observe the R_i divide the D_x into k-1 intervals, [R₁,R₂), [R₂,R₃), ... [R_k-1,R_k). Every D_x belongs to exactly one of those intervals. Let q_i be the number of D_x in the interval [R_i,R_i+1), and let s_i be the sum of those D_x. Then each error(R_i) expression is the sum of q_i terms and evaluates to s_i - q_iR_i.

Summing that over all i, we get a total error of S - sum(q_iR_i), where S is the sum of all the D_x. So the problem is to choose the R_i to maximize sum(q_iR_i). Remember each q_i is the number of original data at least as large as R_i, but smaller than the next one.

Any global maximum must be a local maximum; so we imagine increasing or decreasing one of the R_i. If R_i is not one of the original data values, then we can increase it without changing any of the q_i and improve our target function. So an optimal solution has each R_i (except the limiting last one) as one of the data values. I got a bit bogged down in math after that, but it seems a sensible approach is to pick the initial R_i as every (n/k)th data value (simple percentiles), then iteratively seeing if moving the R_i to the previous or next value improves the score and thus decreases the error. (The q_iR_i seems easier to work with, since you can read the data and count repetitions and update q_i, R_i by only looking at a single data/count point. You only need to store an array of 10,000 data counts, no matter how huge the data).
```
data:   1  3  7  8 14 30
count:  1  2  1  1  3  1     sum(data) = 94

initial R: 1  3  8  14  31
initial Q: 1  3  1   4        sum(QR)  = 74 (hence error = 20)
```
In this example, we could try changing the 3 or the 8 to a 7, For example if we increase the 3 to 7, then we see there are 2 3's in the initial data, so the first two Q's become 1+2, 3-2 - it turns out this decreases sum(QR)). I'm sure there are smarter patterns to detect what changes in the QR table are viable, but this seems workable.
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...