I am trying to obtain document clustering using Weka. The process is a part of a larger pipeline, and I really can't afford to write out arff files. I have all the documents and the bag of words in each document as a Map<String, Multiset<String>>
structure, where the keys are document names, and the Multiset<String>
values are the bags of words in the documents. I have two questions, really:
(1) Current approach ends up clustering terms, not documents:
public final Instances buildDocumentInstances(TreeMap<String, Multiset<String>> docToTermsMap, String encoding) throws IOException {
int dimension = TermToDocumentFrequencyMap.navigableKeySet().size();
FastVector attributes = new FastVector(dimension);
for (String s : TermToDocumentFrequencyMap.navigableKeySet()) attributes.addElement(new Attribute(s));
List<Instance> instances = Lists.newArrayList();
for (Map.Entry<String, Multiset<String>> entry : docToTermsMap.entrySet()) {
Instance instance = new Instance(dimension);
for (Multiset.Entry<String> ms_entry : entry.getValue().entrySet()) {
Integer index = TermToIndexMap.get(ms_entry.getElement());
if (index != null)
switch (encoding) {
case "tf":
instance.setValue(index, ms_entry.getCount());
break;
case "binary":
instance.setValue(index, ms_entry.getCount() > 0 ? 1 : 0);
break;
case "tfidf":
double tf = ms_entry.getCount();
double df = TermToDocumentFrequencyMap.get(ms_entry.getElement());
double idf = Math.log(TermToIndexMap.size() / df);
instance.setValue(index, tf * idf);
break;
}
}
instances.add(instance);
}
Instances dataset = new Instances("My Dataset Name", attributes, instances.size());
for (Instance instance : instances) dataset.add(instance);
return dataset;
}
I am trying to create individual Instance
objects, and then create a dataset by adding them to an Instances
object. Each instance is a document-vector (with 0/1, tf or tf-idf encoding). Also, each word is a separate attribute. But when I run SimpleKMeans#buildClusterer
, the output shows that it's clustering the words, not the documents. I am clearly doing something horribly wrong, but I can't figure out what that mistake is.
(2) How to use StringToWordVector in this scenario?
Everywhere I have looked, people suggest using weka.filters.unsupervised.attribute.StringToWordVector
to cluster documents. But, I can't find any examples where I can use it in a way that allows me take the words from my document --> bag-of-words structure. [Note: In my case, it is a Map<String, Multiset<String>
, but that is not a rigid requirement. I can transform it into some other data structure if StringToWordVector
requires it.]
来源:https://stackoverflow.com/questions/20768607/using-stringtowordvector-in-weka-with-internal-data-structures