Dumping clustering result with vectors names

I have created my Vectors as described in this question and have run mahout kmeans on the data.

Since I'm using Mahout 0.7, the clusterdump command didn't work as described in Mahout in Action, but I got it to work like this:

export HADOOP_CLASSPATH=/path/to/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar:/path/to/mahout-distribution-0.7/integration/target/mahout-integration-0.7.jar
hadoop jar core/target/mahout-core-0.7-job.jar org.apache.mahout.utils.clustering.ClusterDumper -i /clustering/out/clusters-20-final -o textout -of TEXT

and I am getting lines like this one:

VL-1383471{n=192 c=[0.180, -0.087, 0.281, 0.512, 0.678, 1.833, 2.613, 0.313, 0.226, 1.023, 0.229, -0.104, -0.461, -0.553, -0.318, 0.315, 0.658, 0.245, 0.635, 0.220, 0.660, 0.193, 0.277, -0.182, 0.497, 0.346, 0.658, 0.660, 0.191, 0.660, 0.636, 0.018, 0.519, 0.335, 0.535, 0.008, -0.028, 0.461, 0.229, 0.287, 0.619, 0.509, 0.566, 0.389, -0.075, -0.180, -0.461, 0.381, -0.108, 0.126, -0.728] r=[0.983, 0.890, 0.384, 0.823, 0.702, 0.000, 0.000, 1.132, 0.605, 0.979, 0.897, 0.862, 0.438, 0.546, 0.390, 0.171, 0.257, 0.234, 0.251, 0.106, 0.257, 0.093, 0.929, 0.077, 0.204, 0.218, 0.257, 0.257, 0.258, 0.257, 0.249, 0.112, 0.217, 0.157, 0.284, 0.197, 0.228, 0.229, 0.323, 0.401, 0.248, 0.217, 0.269, 1.002, 0.819, 0.706, 0.412, 0.964, 0.787, 0.872, 0.172]}

which is not yet useful to me, since I need the names of my vectors in each cluster. I saw that for text documents a dictionary file is created. How would I create a dictionary for my data?

Also, using -of CSV gives me an empty file, am I doing something wrong?

Another attempt I took was to directly access the cluster-20-final/part-m-00000 file, like it's done in listing 7.2 of Mahout in Action. Turns out it doesn't contain WeightedVectorWritable but ClusterWritable, from which I can get the Cluster instance but not any actual contained Vector.

A bit late, but this might help someone somewhere, sometime.

When running

KMeansDriver.run(input, clustersIn, outputPath, measure, convergenceDelta, maxIterations, true, 0.0, false);

One of the outputs was a directory called clusteredPoints. There is a part file there with all the clustered vectors by cluster. This means that something like this

    IntWritable key = new IntWritable();
    WeightedVectorWritable value = new WeightedVectorWritable();

    Path clusteredPoints = new Path(output + "/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000");

    FileSystem fs = FileSystem.get(clusteredPoints.toUri(), new Configuration());

    try (SequenceFile.Reader reader = new SequenceFile.Reader(fs, clusteredPoints, fs.getConf())) {

        while (reader.next(key, value)) {
            // Do something useful here
            ((NamedVector) value.getVector()).getName();
        }

    } catch (Throwable t) {
        throw t;
    }

seems to do the trick. Using something like this, I was able to get a good sense of what was clustered where when running my tests with k-means clustering and Mahout.

I was using Mahout 0.8 when I did this.

(a really late answer, but since I just spent a day figuring this out thought I would share it)

What you are missing is the dictionary of Vector Dimension name to its index. This dictionary will be used by clusterdump to give you the names of the different dimensions in the vector.

When you run clusterdump, you can specify two additional flags:

d: dictionary file
dt: type of the dictionary file (text|sequencefile)

Here is a sample invocation:

mahout clusterdump -i clusteringExperiment/exp1/initialCentroids/clusters-0-final -d clusteringExperiment/dictionary/vectorDimensions -dt sequencefile

and your output will look something like:

VL-0{n=185 c=[A:0.006, G:0.550, M:0.011, O:0.026, S:0.000, T:0.072, U:0.096, V:0.010] r=[A:0.029, G:0.176, M:0.043, O:0.054, S:0.001, T:0.098, U:0.113, V:0.035]}

Note that the dictionary is a simple key value file, where the key is the category name (a string), and the value is the numerical index.

来源：https://stackoverflow.com/questions/14476706/dumping-clustering-result-with-vectors-names

标签

cluster-analysis

mahout