Extract links from string optimization

前端 未结 7 992
予麋鹿
予麋鹿 2021-01-03 05:21

I get data (HTML string) from website. I want to extract all links. I write function (it works), but it is so slow...

Can you help me to optimize it? What standard

相关标签:
7条回答
  • 2021-01-03 05:31

    Details

    • Swift 5.2, Xcode 11.4 (11E146)

    Solution

    // MARK: DataDetector
    
    class DataDetector {
    
        private class func _find(all type: NSTextCheckingResult.CheckingType,
                                 in string: String, iterationClosure: (String) -> Bool) {
            guard let detector = try? NSDataDetector(types: type.rawValue) else { return }
            let range = NSRange(string.startIndex ..< string.endIndex, in: string)
            let matches = detector.matches(in: string, options: [], range: range)
            loop: for match in matches {
                for i in 0 ..< match.numberOfRanges {
                    let nsrange = match.range(at: i)
                    let startIndex = string.index(string.startIndex, offsetBy: nsrange.lowerBound)
                    let endIndex = string.index(string.startIndex, offsetBy: nsrange.upperBound)
                    let range = startIndex..<endIndex
                    guard iterationClosure(String(string[range])) else { break loop }
                }
            }
        }
    
        class func find(all type: NSTextCheckingResult.CheckingType, in string: String) -> [String] {
            var results = [String]()
            _find(all: type, in: string) {
                results.append($0)
                return true
            }
            return results
        }
    
        class func first(type: NSTextCheckingResult.CheckingType, in string: String) -> String? {
            var result: String?
            _find(all: type, in: string) {
                result = $0
                return false
            }
            return result
        }
    }
    
    // MARK: String extension
    
    extension String {
        var detectedLinks: [String] { DataDetector.find(all: .link, in: self) }
        var detectedFirstLink: String? { DataDetector.first(type: .link, in: self) }
        var detectedURLs: [URL] { detectedLinks.compactMap { URL(string: $0) } }
        var detectedFirstURL: URL? {
            guard let urlString = detectedFirstLink else { return nil }
            return URL(string: urlString)
        }
    }
    

    Usage

    let text = """
    Lorm Ipsum is simply dummy text of the printing and typesetting industry. apple.com/ Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. http://gooogle.com. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. yahoo.com It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
    """
    
    print(text.detectedLinks)
    print(text.detectedFirstLink)
    print(text.detectedURLs)
    print(text.detectedFirstURL)
    

    Console output

    ["apple.com/", "http://gooogle.com", "yahoo.com"]
    Optional("apple.com/")
    [apple.com/, http://gooogle.com, yahoo.com]
    Optional(apple.com/)
    
    0 讨论(0)
  • 2021-01-03 05:33

    Like AdamPro13 said above using NSDataDetector you can easily get all the URLs, see it the following code :

    let text = "http://www.google.com. http://www.bla.com"
    let types: NSTextCheckingType = .Link
    var error : NSError?
    
    let detector = NSDataDetector(types: types.rawValue, error: &error)        
    var matches = detector!.matchesInString(text, options: nil, range: NSMakeRange(0, count(text)))
    
    for match in matches {
       println(match.URL!)
    }
    

    It outputs :

    http://www.google.com
    http://www.bla.com
    

    Updated to Swift 2.0

    let text = "http://www.google.com. http://www.bla.com"
    let types: NSTextCheckingType = .Link
    
    let detector = try? NSDataDetector(types: types.rawValue)
    
    guard let detect = detector else {
       return
    }
    
    let matches = detect.matchesInString(text, options: .ReportCompletion, range: NSMakeRange(0, text.characters.count))
    
    for match in matches {
        print(match.URL!)
    }
    

    Remember to use the guard statement in the above case it must be inside a function or loop.

    I hope this help.

    0 讨论(0)
  • 2021-01-03 05:33

    Very helpful thread! Here's an example that worked in Swift 1.2, based on Victor Sigler's answer.

        // extract first link (if available) and open it!
        let text = "How technology is changing our relationships to each other: http://t.ted.com/mzRtRfX"
        let types: NSTextCheckingType = .Link
    
        do {
            let detector = try NSDataDetector(types: types.rawValue)
            let matches = detector.matchesInString(text, options: .ReportCompletion, range: NSMakeRange(0, text.characters.count))
            if matches.count > 0 {
                let url = matches[0].URL!
                print("Opening URL: \(url)")
                UIApplication.sharedApplication().openURL(url)
            }
    
        } catch {
            // none found or some other issue
            print ("error in findAndOpenURL detector")
        }
    
    0 讨论(0)
  • 2021-01-03 05:33

    There's actually a class called NSDataDetector that will detect the link for you.

    You can find an example of it on NSHipster here: http://nshipster.com/nsdatadetector/

    0 讨论(0)
  • 2021-01-03 05:35

    I wonder if you realise that every single time that you call countElements, a major complex function is called that has to scan all the Unicode characters in your string, and extract extended grapheme clusters from them and count them. If you don't know what an extended grapheme cluster is then you should be able to imagine that this isn't cheap and major overkill.

    Just convert it to an NSString*, call rangeOfString and be done with it.

    Obviously what you do is totally unsafe, because http:// doesn't mean there is a link. You can't just look for strings in html and hope it works; it doesn't. And then there is https, Http, hTtp, htTp, httP and so on and so on and so on. But that's all easy, for the real horror follow the link in Uttam Sinha's comment.

    0 讨论(0)
  • 2021-01-03 05:35

    As others have pointed out, you are better off using regexes, data detectors or a parsing library. However, as specific feedback on your string processing:

    The key with Swift strings is to embrace the forward-only nature of them. More often than not, integer indexing and random access is not necessary. As @gnasher729 pointed out, every time you call count you are iterating over the string. Similarly, the integer indexing extensions are linear, so if you use them in a loop, you can easily accidentally create a quadratic or cubic-complexity algorithm.

    But in this case, there's no need to do all that work to convert string indices to random-access integers. Here is a version that I think is performing similar logic (look for a prefix, then look from there for a " character - ignoring that this doesn't cater for https, upper/lower case etc) using only native string indices:

    func extractAllLinks(text: String) -> [String] {
        var links: [String] = []
        let prefix = "http://"
        let prefixLen = count(prefix)
    
        for var idx = text.startIndex; idx != text.endIndex; ++idx {
            let candidate = text[idx..<text.endIndex]
            if candidate.hasPrefix(prefix),
               let closingQuote = find(candidate, "\"") {
                let link = candidate[candidate.startIndex..<closingQuote]
                links.append(link)
                idx = advance(idx, count(link))
            }
        }
        return links
    }
    
    let text = "This contains the link \"http://www.whatever.com/\" and"
             + " the link \"http://google.com\""
    
    extractAllLinks(text)
    

    Even this could be further optimized (the advance(idx, count()) is a little inefficient) if there were other helpers such as findFromIndex etc. or a willingness to do without string slices and hand-roll the search for the end character.

    0 讨论(0)
提交回复
热议问题