I have an array of floats like this:
[1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200]
Now, I want to partition the arra
Clustering usually assumes multidimensional data.
If you have one dimensional data, sort it, and then use either kernel density estimation, or just scan for the largest gaps.
In 1 dimension, the problem gets substantially easier, because the data can be sorted. If you use a clustering algorithm, it will unfortunately not exploit this, so use a 1 dimensional method instead!
Consider finding the largest gap in 1 dimensional data. It's trivial: sort (n log n, but in practise as fast as it can get), then look at two adjacent values for the largest difference.
Now try defining "largest gap" in 2 dimensions, and an efficient algorithm to locate it...
I think I'd sort the data (if it's not already), then take adjacent differences. Divide the differences by the smaller of the numbers it's a difference between to get a percentage change. Set a threshold and when the change exceeds that threshold, start a new "cluster".
Edit: Quick demo code in C++:
#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <numeric>
#include <functional>
int main() {
std::vector<double> data{
1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200
};
// sort the input data
std::sort(data.begin(), data.end());
// find the difference between each number and its predecessor
std::vector<double> diffs;
std::adjacent_difference(data.begin(), data.end(), std::back_inserter(diffs));
// convert differences to percentage changes
std::transform(diffs.begin(), diffs.end(), data.begin(), diffs.begin(),
std::divides<double>());
// print out the results
for (int i = 0; i < data.size(); i++) {
// if a difference exceeds 40%, start a new group:
if (diffs[i] > 0.4)
std::cout << "\n";
// print out an item:
std::cout << data[i] << "\t";
}
return 0;
}
Result:
1.91 2.87 3.61
10.91 11.91 12.82
100.71 100.73 101.89
200