How do you parse a web page and extract all the href links?

后端 未结 7 1949
情话喂你
情话喂你 2021-01-01 19:18

I want to parse a web page in Groovy and extract all of the href links and the associated text with it.

If the page contained these links:



        
相关标签:
7条回答
  • 2021-01-01 19:26

    I don't know java but I think that xpath is far better than classic regular expressions in order to get one (or more) html elements.

    It is also easier to write and to read.

    <html>
       <body>
          <a href="1.html">1</a>
          <a href="2.html">2</a>
          <a href="3.html">3</a>
       </body>
    </html>
    

    With the html above, this expression "/html/body/a" will list all href elements.

    Here's a good step by step tutorial http://www.zvon.org/xxl/XPathTutorial/General/examples.html

    0 讨论(0)
  • 2021-01-01 19:28

    A quick google search turned up a nice looking possibility, TagSoup.

    0 讨论(0)
  • 2021-01-01 19:28

    Use XMLSlurper to parse the HTML as an XML document and then use the find method with an appropriate closure to select the a tags and then use the list method on GPathResult to get a list of the tags. You should then be able to extract the text as children of the GPathResult.

    0 讨论(0)
  • 2021-01-01 19:28

    Html parser + Regular expressions Any language would do it, though I'd say Perl is the fastest solution.

    0 讨论(0)
  • 2021-01-01 19:42

    Parsing using XMlSlurper only works if HTMl is well-formed.

    If your HTMl page has non-well-formed tags, then use regex for parsing the page.

    Ex: <a href="www.google.com">

    here, 'a' is not closed and thus not well formed.

     new URL(url).eachLine{
       (it =~ /.*<A HREF="(.*?)">/).each{
           // process hrefs
       }
    }
    
    0 讨论(0)
  • 2021-01-01 19:44

    Try a regular expression. Something like this should work:

    (html =~ /<a.*href='(.*?)'.*>(.*?)<\/a>/).each { url, text -> 
        // do something with url and text
    }
    

    Take a look at Groovy - Tutorial 4 - Regular expressions basics and Anchor Tag Regular Expression Breaking.

    0 讨论(0)
提交回复
热议问题