Using StringToWordVector in Weka with internal data structures

问题

I am trying to obtain document clustering using Weka. The process is a part of a larger pipeline, and I really can't afford to write out arff files. I have all the documents and the bag of words in each document as a Map<String, Multiset<String>> structure, where the keys are document names, and the Multiset<String> values are the bags of words in the documents. I have two questions, really:

(1) Current approach ends up clustering terms, not documents:

public final Instances buildDocumentInstances(TreeMap<String, Multiset<String>> docToTermsMap, String encoding) throws IOException {
    int dimension = TermToDocumentFrequencyMap.navigableKeySet().size();
    FastVector attributes = new FastVector(dimension);
    for (String s : TermToDocumentFrequencyMap.navigableKeySet()) attributes.addElement(new Attribute(s));
    List<Instance> instances = Lists.newArrayList();
    for (Map.Entry<String, Multiset<String>> entry : docToTermsMap.entrySet()) {
        Instance instance = new Instance(dimension);
        for (Multiset.Entry<String> ms_entry : entry.getValue().entrySet()) {
            Integer index = TermToIndexMap.get(ms_entry.getElement());
            if (index != null)
                switch (encoding) {
                case "tf":
                    instance.setValue(index, ms_entry.getCount());
                    break;
                case "binary":
                    instance.setValue(index, ms_entry.getCount() > 0 ? 1 : 0);
                    break;
                case "tfidf":
                    double tf = ms_entry.getCount();
                    double df = TermToDocumentFrequencyMap.get(ms_entry.getElement());
                    double idf = Math.log(TermToIndexMap.size() / df);
                    instance.setValue(index, tf * idf);
                    break;
                }
        }
        instances.add(instance);
    }
    Instances dataset = new Instances("My Dataset Name", attributes, instances.size());
    for (Instance instance : instances) dataset.add(instance);
    return dataset;
}

I am trying to create individual Instance objects, and then create a dataset by adding them to an Instances object. Each instance is a document-vector (with 0/1, tf or tf-idf encoding). Also, each word is a separate attribute. But when I run SimpleKMeans#buildClusterer, the output shows that it's clustering the words, not the documents. I am clearly doing something horribly wrong, but I can't figure out what that mistake is.

(2) How to use StringToWordVector in this scenario? Everywhere I have looked, people suggest using weka.filters.unsupervised.attribute.StringToWordVector to cluster documents. But, I can't find any examples where I can use it in a way that allows me take the words from my document --> bag-of-words structure. [Note: In my case, it is a Map<String, Multiset<String>, but that is not a rigid requirement. I can transform it into some other data structure if StringToWordVector requires it.]

来源：https://stackoverflow.com/questions/20768607/using-stringtowordvector-in-weka-with-internal-data-structures

标签

java

machine-learning

nlp

cluster-analysis

weka