Better (non-linear) binning

后端 未结 2 909
予麋鹿
予麋鹿 2021-01-15 10:18

The last question I asked concerned how to bin data by x co-ordinate. The solution was simple and elegant, and I\'m ashamed I didn\'t see it. This question may be harder (

相关标签:
2条回答
  • 2021-01-15 11:05

    It sounds like you want to use bins that vary in size depending on the density of x values. I think you can still use the function HISTC like in the answer to your previous post, but you would just have to give it a different set of edges.

    I don't know if this is exactly want you want, but here's one suggestion: instead of splitting the x axis into 70 equally spaced groups, split the sorted x data into 70 equal groups and determine the edge values. I think this code should work:

    % Start by assuming x and y are vectors of data:
    
    nBins = 70;
    nValues = length(x);
    [xsort,index] = sort(x);  % Sort x in ascending order
    ysort = y(index);         % Sort y the same way as x
    binEdges = [xsort(1:ceil(nValues/nBins):nValues) xsort(nValues)+1];
    
    % Bin the data and get the averages as in previous post (using ysort instead of y):
    
    [h,whichBin] = histc(xsort,binEdges);
    
    for i = 1:nBins
        flagBinMembers = (whichBin == i);
        binMembers = ysort(flagBinMembers);
        binMean(i) = mean(binMembers);
    end
    

    This should give you bins that vary in size with the data density.


    UPDATE: Another version...

    Here's another idea I came up with after a few comments. With this code, you set a threshold (maxDelta) for the difference between neighboring data points in x. Any x values that differ from their larger neighbor by an amount greater than or equal to maxDelta are forced to be in their own bin (all by their lonesome). You still choose a value for nBins, but the final number of bins will be larger than this value when spread-out points are relegated to their own bins.

    % Start by assuming x and y are vectors of data:
    
    maxDelta = 10; % Or whatever suits your data set!
    nBins = 70;
    nValues = length(x);
    [xsort,index] = sort(x);  % Sort x in ascending order
    ysort = y(index);         % Sort y the same way as x
    
    % Create bin edges:
    
    edgeIndex = false(1,nValues);
    edgeIndex(1:ceil(nValues/nBins):nValues) = true;
    edgeIndex = edgeIndex | ([0 diff(xsort)] >= maxDelta);
    nBins = sum(edgeIndex);
    binEdges = [xsort(edgeIndex) xsort(nValues)+1];
    
    % Bin the data and get the y averages:
    
    [h,whichBin] = histc(xsort,binEdges);
    
    for i = 1:nBins
        flagBinMembers = (whichBin == i);
        binMembers = ysort(flagBinMembers);
        binMean(i) = mean(binMembers);
    end
    

    I tested this on a few small sample sets of data and it seems to do what it's supposed to. Hopefully it will work for your data set too, whatever it contains! =)

    0 讨论(0)
  • 2021-01-15 11:07

    I've never used matlab but from looking at your previous question I suspect your looking for something along the lines of a Kdtree or a variation.

    Clarification: Since there seems to be some confusion about this I think an pseudocode example is in order.

    // Some of this shamelessly borrowed from the wikipedia article
    function kdtree(points, lower_bound, upper_bound) {
        // lower_bound and upper_bound are the boundaries of your bucket
        if(points is empty) {
            return nil
        }
        // It's a trivial exercise to control the minimum size of a partition as well
        else {
            // Sort the points list and choose the median element
            select median from points.x
    
            node.location = median;
    
            node.left = kdtree(select from points where lower_bound < points.x <= median, lower_bound, median);
            node.right = kdtree(select from points where median < points.x <= upper_bound, median, upper_bound);
    
            return node
        }
    }
    
    kdtree(points, -inf, inf)
    
    // or alternatively
    
    kdtree(points, min(points.x), max(points.x))
    
    0 讨论(0)
提交回复
热议问题