问题
I want to extract only the views and replies of the user and the title of the head from a forum. In this code when you supply a url the code returns everything. I just want only the thread heading which is defined in title tag and the user reply which is in between the div content tag. Help me how extract. Explain how to print this in a txt file
package extract;
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
public class TestJsoup
{
public void SimpleParse()
{
try
{
Document doc = Jsoup.connect("url").get();
doc.body().wrap("<div></div>");
doc.body().wrap("<pre></pre>");
String text = doc.text();
// Converting nbsp entities
text = text.replaceAll("\u00A0", " ");
System.out.print(text);
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String args[])
{
TestJsoup tjs = new TestJsoup();
tjs.SimpleParse();
}
}
回答1:
Why do you wrapt the body-Element in a div and a pre Tag?
The title-Element can be selected like this:
Document doc = Jsoup.connect("url").get();
Element titleElement = doc.select("title").first();
String titleText = titleElement.text();
// Or shorter ...
String titleText = doc.select("title").first().text();
Div-Tags:
// Document 'doc' as above
Elements divTags = doc.select("div");
for( Element element : divTags )
{
// Do something there ... eg. print each element
System.out.println(element);
// Or get the Text of it
String text = element.text();
}
Here's an overview about the whole Jsoup Selector API, this will help you finding any kind of element you need.
回答2:
Well I used another code and I collected data from this specific tags.
Elements content = doc.getElementsByTag("blockquote");
Elements k=doc.select("[postcontent restore]");
content.select("blockquote").remove();
content.select("br").remove();
content.select("div").remove();
content.select("a").remove();
content.select("b").remove();
来源:https://stackoverflow.com/questions/13005872/extract-the-thread-head-and-thread-reply-from-a-forum