Minimize the sum of errors of representative integers

前端 未结 8 703
走了就别回头了
走了就别回头了 2021-02-07 20:13

Given n integers between [0,10000] as D1,D2...,Dn, where there may be duplicates, and n can be huge:

I want to find k distinct represent

8条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-02-07 20:34

    Now the question is clarified, we observe the Ri divide the Dx into k-1 intervals, [R1,R2), [R2,R3), ... [Rk-1,Rk). Every Dx belongs to exactly one of those intervals. Let qi be the number of Dx in the interval [Ri,Ri+1), and let si be the sum of those Dx. Then each error(Ri) expression is the sum of qi terms and evaluates to si - qiRi.

    Summing that over all i, we get a total error of S - sum(qiRi), where S is the sum of all the Dx. So the problem is to choose the Ri to maximize sum(qiRi). Remember each qi is the number of original data at least as large as Ri, but smaller than the next one.

    Any global maximum must be a local maximum; so we imagine increasing or decreasing one of the Ri. If Ri is not one of the original data values, then we can increase it without changing any of the qi and improve our target function. So an optimal solution has each Ri (except the limiting last one) as one of the data values. I got a bit bogged down in math after that, but it seems a sensible approach is to pick the initial Ri as every (n/k)th data value (simple percentiles), then iteratively seeing if moving the R_i to the previous or next value improves the score and thus decreases the error. (The qiRi seems easier to work with, since you can read the data and count repetitions and update qi, Ri by only looking at a single data/count point. You only need to store an array of 10,000 data counts, no matter how huge the data).

    data:   1  3  7  8 14 30
    count:  1  2  1  1  3  1     sum(data) = 94
    
    initial R: 1  3  8  14  31
    initial Q: 1  3  1   4        sum(QR)  = 74 (hence error = 20)
    

    In this example, we could try changing the 3 or the 8 to a 7, For example if we increase the 3 to 7, then we see there are 2 3's in the initial data, so the first two Q's become 1+2, 3-2 - it turns out this decreases sum(QR)). I'm sure there are smarter patterns to detect what changes in the QR table are viable, but this seems workable.

提交回复
热议问题