How do you parse a web page and extract all the href links?

后端 未结 7 1950
情话喂你
情话喂你 2021-01-01 19:18

I want to parse a web page in Groovy and extract all of the href links and the associated text with it.

If the page contained these links:



        
相关标签:
7条回答
  • 2021-01-01 19:50

    Assuming well-formed XHTML, slurp the xml, collect up all the tags, find the 'a' tags, and print out the href and text.

    input = """<html><body>
    <a href = "http://www.hjsoft.com/">John</a>
    <a href = "http://www.google.com/">Google</a>
    <a href = "http://www.stackoverflow.com/">StackOverflow</a>
    </body></html>"""
    
    doc = new XmlSlurper().parseText(input)
    doc.depthFirst().collect { it }.findAll { it.name() == "a" }.each {
        println "${it.text()}, ${it.@href.text()}"
    }
    
    0 讨论(0)
提交回复
热议问题