tags
I want to extract texts from HTML page(s) which placed in p
and li
tags, so I can start to tokenize the page to construct inverted index(es) for ea
This can do the job
Elements e=doc.select("p");
Here is a list of all selectors you can use.
Suppose you have this html:
String html="<p>some <strong>bold</strong> text</p>";
To get some bold text
as result you should use:
Document doc = Jsoup.parse(html);
Element p= doc.select("p").first();
String text = doc.body().text(); //some bold text
or
String text = p.text(); //some bold text
Suppose now you have the following complex html
String html="<div id=someid><p>some text</p><span>some other text</span><p> another p tag</p></div>"
To get the values from the two p
tags you have to do something like this
Document doc = Jsoup.parse(html);
Element content = doc.getElementById("someid");
Elements p= content.getElementsByTag("p");
String pConcatenated="";
for (Element x: p) {
pConcatenated+= x.text();
}
System.out.println(pConcatenated);//sometext another p tag
You can find more info here also
Hope this helped
Try this:
File input = new File("/home/s5/Downloads/PDFCopy/PDs.html");
Document doc = Jsoup.parse(input, "UTF-8","http://www.cisco.com/c/en/us/products/collateral/wireless/aironet-1815-series-access-points/datasheet-c78-738481.pdf");
Elements link = doc.select("p");
String linkText = link.text();
//System.out.println(linkText);
String[] words=linkText.split("\\W");
for(String str:words)
{
System.out.println(str);
}
}
}
String testText1 = d.select("body").text();
System.out.println(testText);
or
String testText2 = d.select("body p").text();
System.out.println(testText);
You can use this for getting the text from tags.