Parse broken HTML with golang

后端 未结 1 357
北荒
北荒 2021-02-01 10:33

I need to find elements in an HTML string. Unfortunately the HTML is pretty much broken (e.g. closing tags without an opening pair).

I tried to use XPath with launch

相关标签:
1条回答
  • 2021-02-01 10:52

    It seems net/html does the job.

    So that's what I am doing now:

    package main
    
    import (
        "strings"
        "golang.org/x/net/html"
        "log"
        "bytes"
        "gopkg.in/xmlpath.v2"
    )
    
    func main() {
        brokenHtml := `<!DOCTYPE html><html><body><h1 id="someid">My First Heading</h1><p>paragraph</body></html>`
    
        reader := strings.NewReader(brokenHtml)
        root, err := html.Parse(reader)
    
        if err != nil {
            log.Fatal(err)
        }
    
        var b bytes.Buffer
        html.Render(&b, root)
        fixedHtml := b.String()
    
        reader = strings.NewReader(fixedHtml)
        xmlroot, xmlerr := xmlpath.ParseHTML(reader)
    
        if xmlerr != nil {
            log.Fatal(xmlerr)
        }
    
        var xpath string
        xpath = `//h1[@id='someid']`
        path := xmlpath.MustCompile(xpath)
        if value, ok := path.String(xmlroot); ok {
            log.Println("Found:", value)
        }
    }
    
    0 讨论(0)
提交回复
热议问题