Better (non-linear) binning

后端未结

关注

 2  909

The last question I asked concerned how to bin data by x co-ordinate. The solution was simple and elegant, and I\'m ashamed I didn\'t see it. This question may be harder (

相关标签:

2条回答

天涯浪人

2021-01-15 11:05
It sounds like you want to use bins that vary in size depending on the density of x values. I think you can still use the function HISTC like in the answer to your previous post, but you would just have to give it a different set of edges.

I don't know if this is exactly want you want, but here's one suggestion: instead of splitting the x axis into 70 equally spaced groups, split the sorted x data into 70 equal groups and determine the edge values. I think this code should work:
```
% Start by assuming x and y are vectors of data:

nBins = 70;
nValues = length(x);
[xsort,index] = sort(x);  % Sort x in ascending order
ysort = y(index);         % Sort y the same way as x
binEdges = [xsort(1:ceil(nValues/nBins):nValues) xsort(nValues)+1];

% Bin the data and get the averages as in previous post (using ysort instead of y):

[h,whichBin] = histc(xsort,binEdges);

for i = 1:nBins
    flagBinMembers = (whichBin == i);
    binMembers = ysort(flagBinMembers);
    binMean(i) = mean(binMembers);
end
```
This should give you bins that vary in size with the data density.

UPDATE: Another version...

Here's another idea I came up with after a few comments. With this code, you set a threshold (maxDelta) for the difference between neighboring data points in x. Any x values that differ from their larger neighbor by an amount greater than or equal to maxDelta are forced to be in their own bin (all by their lonesome). You still choose a value for nBins, but the final number of bins will be larger than this value when spread-out points are relegated to their own bins.
```
% Start by assuming x and y are vectors of data:

maxDelta = 10; % Or whatever suits your data set!
nBins = 70;
nValues = length(x);
[xsort,index] = sort(x);  % Sort x in ascending order
ysort = y(index);         % Sort y the same way as x

% Create bin edges:

edgeIndex = false(1,nValues);
edgeIndex(1:ceil(nValues/nBins):nValues) = true;
edgeIndex = edgeIndex | ([0 diff(xsort)] >= maxDelta);
nBins = sum(edgeIndex);
binEdges = [xsort(edgeIndex) xsort(nValues)+1];

% Bin the data and get the y averages:

[h,whichBin] = histc(xsort,binEdges);

for i = 1:nBins
    flagBinMembers = (whichBin == i);
    binMembers = ysort(flagBinMembers);
    binMean(i) = mean(binMembers);
end
```
I tested this on a few small sample sets of data and it seems to do what it's supposed to. Hopefully it will work for your data set too, whatever it contains! =)
0 讨论(0)
发布评论:

提交评论
- 加载中...

悲&欢浪女

2021-01-15 11:07

I've never used matlab but from looking at your previous question I suspect your looking for something along the lines of a Kdtree or a variation.

Clarification: Since there seems to be some confusion about this I think an pseudocode example is in order.

// Some of this shamelessly borrowed from the wikipedia article
function kdtree(points, lower_bound, upper_bound) {
    // lower_bound and upper_bound are the boundaries of your bucket
    if(points is empty) {
        return nil
    }
    // It's a trivial exercise to control the minimum size of a partition as well
    else {
        // Sort the points list and choose the median element
        select median from points.x

        node.location = median;

        node.left = kdtree(select from points where lower_bound < points.x <= median, lower_bound, median);
        node.right = kdtree(select from points where median < points.x <= upper_bound, median, upper_bound);

        return node
    }
}

kdtree(points, -inf, inf)

// or alternatively

kdtree(points, min(points.x), max(points.x))

0 讨论(0)