How to extract texts between

tags

前端 未结 3 1954
走了就别回头了
走了就别回头了 2021-01-11 18:18

I want to extract texts from HTML page(s) which placed in p and li tags, so I can start to tokenize the page to construct inverted index(es) for ea

相关标签:
3条回答
  • 2021-01-11 18:40

    This can do the job

    Elements e=doc.select("p"); 
    

    Here is a list of all selectors you can use.

    Suppose you have this html:

    String html="<p>some <strong>bold</strong> text</p>";
    

    To get some bold text as result you should use:

    Document doc = Jsoup.parse(html);
    Element p= doc.select("p").first();
    String text = doc.body().text(); //some bold text
    

    or

    String text = p.text(); //some bold text
    

    Suppose now you have the following complex html

    String html="<div id=someid><p>some text</p><span>some other text</span><p> another p tag</p></div>"
    

    To get the values from the two p tags you have to do something like this

    Document doc = Jsoup.parse(html);
    Element content = doc.getElementById("someid");
    Elements p= content.getElementsByTag("p");
    
    String pConcatenated="";
    for (Element x: p) {
      pConcatenated+= x.text();
    }
    
    System.out.println(pConcatenated);//sometext another p tag
    

    You can find more info here also

    Hope this helped

    0 讨论(0)
  • 2021-01-11 18:46

    Try this:

    File input = new File("/home/s5/Downloads/PDFCopy/PDs.html");
            Document doc = Jsoup.parse(input, "UTF-8","http://www.cisco.com/c/en/us/products/collateral/wireless/aironet-1815-series-access-points/datasheet-c78-738481.pdf");
            Elements link = doc.select("p");
            String linkText = link.text();
            //System.out.println(linkText);
            String[] words=linkText.split("\\W");
            for(String str:words) 
            {
                System.out.println(str);
            }
        }
    }
    
    0 讨论(0)
  • 2021-01-11 18:52
    String testText1 = d.select("body").text();
    System.out.println(testText);
    

    or

    String testText2 = d.select("body p").text();
    System.out.println(testText);
    

    You can use this for getting the text from tags.

    0 讨论(0)
提交回复
热议问题