How would you group/cluster these three areas in arrays in python?

后端 未结 5 848
挽巷
挽巷 2021-02-01 07:42

So you have an array

1
2
3
60
70
80
100
220
230
250

For a better understanding:

\"

5条回答
  •  天涯浪人
    2021-02-01 07:47

    You can solve this in various ways. One of the obvious ones when you throw the keyword "clustering" is to use kmeans (see other replies).

    However, you might want to first understand more closely what you are actually doing or attempting to do. Instead of just throwing a random function on your data.

    As far as I can tell from your question, you have a number of 1-dimensional values, and you want to separate them into an unknown number of groups, right? Well, k-means might do the trick, but in fact, you could just look for the k largest differences in your data set then. I.e. for any index i > 0, compute k[i] - k[i-1], and choose the k indexes where this is larger than for the rest. Most likely, your result will actually be better and faster than using k-means.

    In python code:

    k = 2
    a = [1, 2, 3, 60, 70, 80, 100, 220, 230, 250]
    a.sort()
    b=[] # A *heap* would be faster
    for i in range(1, len(a)):
      b.append( (a[i]-a[i-1], i) )
    b.sort()
    # b now is [... (20, 6), (20, 9), (57, 3), (120, 7)]
    # and the last ones are the best split points.
    b = map(lambda p: p[1], b[-k:])
    b.sort()
    # b now is: [3, 7]
    b.insert(0, 0)
    b.append(len(a) + 1)
    for i in range(1, len(b)):
      print a[b[i-1]:b[i]],
    # Prints [1, 2, 3] [60, 70, 80, 100] [220, 230, 250]
    

    (This can btw. be seen as a simple single-link clustering!)

    A more advanced method, that actually gets rid of the parameter k, computes the mean and standard deviation of b[*][1], and splits whereever the value is larger than say mean+2*stddev. Still this is a rather crude heuristic. Another option would be to actually assume a value distribution such as k normal distributions, and then use e.g. Levenberg-Marquardt to fit the distributions to your data.

    But is that really what you want to do?

    First try to define what should be a cluster, and what not. The second part is much more important.

提交回复
热议问题