问题
In my project I have to count the most frequent words in a Wikipedia article. I found Jsoup for parsing HTML format, but that still leaves the problem of word frequency. Is there a function in Jsoup that count the freqeuncy of words, or any way to find which words are the most frequent on a webpage, using Jsoup ?
Thanks.
回答1:
Yes, you could use Jsoup to get the text from the webpage, like this:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
String text = doc.body().text();
Then, you need to count the words and find out which ones are the most frequent ones. This code looks promising. We need to modify it to use our String output from Jsoup, something like this:
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupWordCount {
public static void main(String[] args) throws IOException {
long time = System.currentTimeMillis();
Map<String, Word> countMap = new HashMap<String, Word>();
//connect to wikipedia and get the HTML
System.out.println("Downloading page...");
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
//Get the actual text from the page, excluding the HTML
String text = doc.body().text();
System.out.println("Analyzing text...");
//Create BufferedReader so the words can be counted
BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8))));
String line;
while ((line = reader.readLine()) != null) {
String[] words = line.split("[^A-ZÅÄÖa-zåäö]+");
for (String word : words) {
if ("".equals(word)) {
continue;
}
Word wordObj = countMap.get(word);
if (wordObj == null) {
wordObj = new Word();
wordObj.word = word;
wordObj.count = 0;
countMap.put(word, wordObj);
}
wordObj.count++;
}
}
reader.close();
SortedSet<Word> sortedWords = new TreeSet<Word>(countMap.values());
int i = 0;
int maxWordsToDisplay = 10;
String[] wordsToIgnore = {"the", "and", "a"};
for (Word word : sortedWords) {
if (i >= maxWordsToDisplay) { //10 is the number of words you want to show frequency for
break;
}
if (Arrays.asList(wordsToIgnore).contains(word.word)) {
i++;
maxWordsToDisplay++;
} else {
System.out.println(word.count + "\t" + word.word);
i++;
}
}
time = System.currentTimeMillis() - time;
System.out.println("Finished in " + time + " ms");
}
public static class Word implements Comparable<Word> {
String word;
int count;
@Override
public int hashCode() { return word.hashCode(); }
@Override
public boolean equals(Object obj) { return word.equals(((Word)obj).word); }
@Override
public int compareTo(Word b) { return b.count - count; }
}
}
Output:
Downloading page...
Analyzing text...
42 of
24 in
20 Wikipedia
19 to
16 is
11 that
10 The
9 was
8 articles
7 featured
Finished in 3300 ms
Some notes:
This code can ignore some words, like "the", "and", "a" etc. You will have to customize it.
It seems to have problems with unicode characters sometimes. Although I don't experience this, someone in the comments did.
This could be done better and with less code.
Not well tested.
Enjoy !
来源:https://stackoverflow.com/questions/29447566/find-most-frequent-words-on-a-webpage-using-jsoup