Regex to match anchor tag and its href

风格不统一 提交于 2019-12-13 09:56:26

问题


I want to run regex through a html string that has multiple anchor tags and construct a dictionary of link text vs its href url.

<p>This is a simple text with some embedded <a href="http://example.com/link/to/some/page?param1=77&param2=22">links</a>. This is a <a href="https://exmp.le/sample-page/?uu=1">different link</a>.

How do I extract for <a> tag's text and href in one go?

Edit:

func extractLinks(html: String) -> Dictionary<String, String>? {

    do {
        let regex = try NSRegularExpression(pattern: "/<([a-z]*)\b[^>]*>(.*?)</\1>/i", options: [])
        let nsString = html as NSString
        let results = regex.matchesInString(html, options: [], range: NSMakeRange(0, nsString.length))
        return results.map { nsString.substringWithRange($0.range)}
    } catch let error as NSError {
        print("invalid regex: \(error.localizedDescription)")
        return nil
    }
}

回答1:


First of all, you need to learn the basic syntax of the pattern of NSRegularExpression:

  • pattern does not contain delimiters
  • pattern does not contain modifiers, you need to pass such info as options
  • When you want to use meta-character \, you need to escape it as \\ in Swift String.

So, the line creating an instance of NSRegularExpression should be something like this:

let regex = try NSRegularExpression(pattern: "<([a-z]*)\\b[^>]*>(.*?)</\\1>", options: .caseInsensitive)

But, as you may already know, your pattern does not contain any code to match href or capture its value.

Something like this would work with your example html:

let pattern = "<a\\b[^>]*\\bhref\\s*=\\s*(\"[^\"]*\"|'[^']*')[^>]*>((?:(?!</a).)*)</a\\s*>"
let regex = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)
let html = "<p>This is a simple text with some embedded <a\n" +
    "href=\"http://example.com/link/to/some/page?param1=77&param2=22\">links</a>.\n" +
    "This is a <a href=\"https://exmp.le/sample-page/?uu=1\">different link</a>."
let matches = regex.matches(in: html, options: [], range: NSRange(0..<html.utf16.count))
var resultDict: [String: String] = [:]
for match in matches {
    let hrefRange = NSRange(location: match.rangeAt(1).location+1, length: match.rangeAt(1).length-2)
    let innerTextRange = match.rangeAt(2)
    let href = (html as NSString).substring(with: hrefRange)
    let innerText = (html as NSString).substring(with: innerTextRange)
    resultDict[innerText] = href
}
print(resultDict)
//->["different link": "https://exmp.le/sample-page/?uu=1", "links": "http://example.com/link/to/some/page?param1=77&param2=22"]

Remember, my pattern above may mistakenly detect ill-formed a-tags or miss some nested structure, also it lacks feature to work with HTML character entities...

If you want to make your code more robust and generic, you'd better consider adopting HTML parsers as suggested by ColGraff and Rob.



来源:https://stackoverflow.com/questions/43814906/regex-to-match-anchor-tag-and-its-href

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!