问题
so I'm using Weka Machine Learning Library's JAVA API and I have the following code:
String html = "repeat repeat repeat";
Attribute input = new Attribute("html",(FastVector) null);
FastVector inputVec = new FastVector();
inputVec.addElement(input);
Instances htmlInst = new Instances("html",inputVec,1);
htmlInst.add(new Instance(1));
htmlInst.instance(0).setValue(0, html);
StringToWordVector filter = new StringToWordVector();
filter.setUseStoplist(true);
filter.setInputFormat(htmlInst);
Instances dataFiltered = Filter.useFilter(htmlInst, filter);
Instance last = dataFiltered.lastInstance();
System.out.println(last);
though StringToWordVector is supposed to count the word occurences within the string, instead of having the word 'repeat' counted 3 times, the count only comes out as 1
what am I doing wrong?
回答1:
Gee... all those lines of code. How about these few lines instead?
public static Map<String, Integer> countWords(String input) {
Map<String, Integer> map = new HashMap<String, Integer>();
Matcher matcher = Pattern.compile("\\b\\w+\\b").matcher(input);
while (matcher.find())
map.put(matcher.group(), map.containsKey(matcher.group()) ? map.get(matcher.group()) + 1 : 1);
return map;
}
Here's the code in action:
public static void main(String[] args) {
System.out.println(countWords("sample, repeat sample, of text"));
}
Output:
{of=1, text=1, repeat=1, sample=2}
回答2:
The default setting is only reporting presence/absence as 0/1. You must enable counting explicitly. Add:
filter.setOutputWordCounts(true);
and re-run.
Weka has an explicit mailing list; posting such questions there might give you faster responses.
来源:https://stackoverflow.com/questions/6811418/java-weka-stringtowordvector-is-not-counting-word-occurences-properly