jsoup - extract text from wikipedia article

前端 未结 3 1182
猫巷女王i
猫巷女王i 2021-01-06 12:58

I\'m writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract all the text of a Wikipedia article (for example all the

3条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-01-06 13:32

    Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").get();
    Element contentDiv = doc.select("div[id=content]").first();
    contentDiv.toString(); // The result
    

    You retrieve formatted content this way, of course. If you want "raw" content you can filter the result with Jsoup.clean or use the call contentDiv.text().

提交回复
热议问题