How do I decode HTML entities in Swift?

后端 未结 23 1886
一生所求
一生所求 2020-11-22 01:47

I am pulling a JSON file from a site and one of the strings received is:

The Weeknd ‘King Of The Fall&         


        
相关标签:
23条回答
  • 2020-11-22 02:44

    Use:

    NSData dataRes = (nsdata value )
    
    var resString = NSString(data: dataRes, encoding: NSUTF8StringEncoding)
    
    0 讨论(0)
  • 2020-11-22 02:45

    Have a look at HTMLString - a library written in Swift that allows your program to add and remove HTML entities in Strings

    For completeness, I copied the main features from the site:

    • Adds entities for ASCII and UTF-8/UTF-16 encodings
    • Removes more than 2100 named entities (like &)
    • Supports removing decimal and hexadecimal entities
    • Designed to support Swift Extended Grapheme Clusters (→ 100% emoji-proof)
    • Fully unit tested
    • Fast
    • Documented
    • Compatible with Objective-C
    0 讨论(0)
  • 2020-11-22 02:46

    @akashivskyy's answer is great and demonstrates how to utilize NSAttributedString to decode HTML entities. One possible disadvantage (as he stated) is that all HTML markup is removed as well, so

    <strong> 4 &lt; 5 &amp; 3 &gt; 2</strong>
    

    becomes

    4 < 5 & 3 > 2
    

    On OS X there is CFXMLCreateStringByUnescapingEntities() which does the job:

    let encoded = "<strong> 4 &lt; 5 &amp; 3 &gt; 2 .</strong> Price: 12 &#x20ac;.  &#64; "
    let decoded = CFXMLCreateStringByUnescapingEntities(nil, encoded, nil) as String
    println(decoded)
    // <strong> 4 < 5 & 3 > 2 .</strong> Price: 12 €.  @ 
    

    but this is not available on iOS.

    Here is a pure Swift implementation. It decodes character entities references like &lt; using a dictionary, and all numeric character entities like &#64 or &#x20ac. (Note that I did not list all 252 HTML entities explicitly.)

    Swift 4:

    // Mapping from XML/HTML character entity reference to character
    // From http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
    private let characterEntities : [ Substring : Character ] = [
        // XML predefined entities:
        "&quot;"    : "\"",
        "&amp;"     : "&",
        "&apos;"    : "'",
        "&lt;"      : "<",
        "&gt;"      : ">",
    
        // HTML character entity references:
        "&nbsp;"    : "\u{00a0}",
        // ...
        "&diams;"   : "♦",
    ]
    
    extension String {
    
        /// Returns a new string made by replacing in the `String`
        /// all HTML character entity references with the corresponding
        /// character.
        var stringByDecodingHTMLEntities : String {
    
            // ===== Utility functions =====
    
            // Convert the number in the string to the corresponding
            // Unicode character, e.g.
            //    decodeNumeric("64", 10)   --> "@"
            //    decodeNumeric("20ac", 16) --> "€"
            func decodeNumeric(_ string : Substring, base : Int) -> Character? {
                guard let code = UInt32(string, radix: base),
                    let uniScalar = UnicodeScalar(code) else { return nil }
                return Character(uniScalar)
            }
    
            // Decode the HTML character entity to the corresponding
            // Unicode character, return `nil` for invalid input.
            //     decode("&#64;")    --> "@"
            //     decode("&#x20ac;") --> "€"
            //     decode("&lt;")     --> "<"
            //     decode("&foo;")    --> nil
            func decode(_ entity : Substring) -> Character? {
    
                if entity.hasPrefix("&#x") || entity.hasPrefix("&#X") {
                    return decodeNumeric(entity.dropFirst(3).dropLast(), base: 16)
                } else if entity.hasPrefix("&#") {
                    return decodeNumeric(entity.dropFirst(2).dropLast(), base: 10)
                } else {
                    return characterEntities[entity]
                }
            }
    
            // ===== Method starts here =====
    
            var result = ""
            var position = startIndex
    
            // Find the next '&' and copy the characters preceding it to `result`:
            while let ampRange = self[position...].range(of: "&") {
                result.append(contentsOf: self[position ..< ampRange.lowerBound])
                position = ampRange.lowerBound
    
                // Find the next ';' and copy everything from '&' to ';' into `entity`
                guard let semiRange = self[position...].range(of: ";") else {
                    // No matching ';'.
                    break
                }
                let entity = self[position ..< semiRange.upperBound]
                position = semiRange.upperBound
    
                if let decoded = decode(entity) {
                    // Replace by decoded character:
                    result.append(decoded)
                } else {
                    // Invalid entity, copy verbatim:
                    result.append(contentsOf: entity)
                }
            }
            // Copy remaining characters to `result`:
            result.append(contentsOf: self[position...])
            return result
        }
    }
    

    Example:

    let encoded = "<strong> 4 &lt; 5 &amp; 3 &gt; 2 .</strong> Price: 12 &#x20ac;.  &#64; "
    let decoded = encoded.stringByDecodingHTMLEntities
    print(decoded)
    // <strong> 4 < 5 & 3 > 2 .</strong> Price: 12 €.  @
    

    Swift 3:

    // Mapping from XML/HTML character entity reference to character
    // From http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
    private let characterEntities : [ String : Character ] = [
        // XML predefined entities:
        "&quot;"    : "\"",
        "&amp;"     : "&",
        "&apos;"    : "'",
        "&lt;"      : "<",
        "&gt;"      : ">",
    
        // HTML character entity references:
        "&nbsp;"    : "\u{00a0}",
        // ...
        "&diams;"   : "♦",
    ]
    
    extension String {
    
        /// Returns a new string made by replacing in the `String`
        /// all HTML character entity references with the corresponding
        /// character.
        var stringByDecodingHTMLEntities : String {
    
            // ===== Utility functions =====
    
            // Convert the number in the string to the corresponding
            // Unicode character, e.g.
            //    decodeNumeric("64", 10)   --> "@"
            //    decodeNumeric("20ac", 16) --> "€"
            func decodeNumeric(_ string : String, base : Int) -> Character? {
                guard let code = UInt32(string, radix: base),
                    let uniScalar = UnicodeScalar(code) else { return nil }
                return Character(uniScalar)
            }
    
            // Decode the HTML character entity to the corresponding
            // Unicode character, return `nil` for invalid input.
            //     decode("&#64;")    --> "@"
            //     decode("&#x20ac;") --> "€"
            //     decode("&lt;")     --> "<"
            //     decode("&foo;")    --> nil
            func decode(_ entity : String) -> Character? {
    
                if entity.hasPrefix("&#x") || entity.hasPrefix("&#X"){
                    return decodeNumeric(entity.substring(with: entity.index(entity.startIndex, offsetBy: 3) ..< entity.index(entity.endIndex, offsetBy: -1)), base: 16)
                } else if entity.hasPrefix("&#") {
                    return decodeNumeric(entity.substring(with: entity.index(entity.startIndex, offsetBy: 2) ..< entity.index(entity.endIndex, offsetBy: -1)), base: 10)
                } else {
                    return characterEntities[entity]
                }
            }
    
            // ===== Method starts here =====
    
            var result = ""
            var position = startIndex
    
            // Find the next '&' and copy the characters preceding it to `result`:
            while let ampRange = self.range(of: "&", range: position ..< endIndex) {
                result.append(self[position ..< ampRange.lowerBound])
                position = ampRange.lowerBound
    
                // Find the next ';' and copy everything from '&' to ';' into `entity`
                if let semiRange = self.range(of: ";", range: position ..< endIndex) {
                    let entity = self[position ..< semiRange.upperBound]
                    position = semiRange.upperBound
    
                    if let decoded = decode(entity) {
                        // Replace by decoded character:
                        result.append(decoded)
                    } else {
                        // Invalid entity, copy verbatim:
                        result.append(entity)
                    }
                } else {
                    // No matching ';'.
                    break
                }
            }
            // Copy remaining characters to `result`:
            result.append(self[position ..< endIndex])
            return result
        }
    }
    

    Swift 2:

    // Mapping from XML/HTML character entity reference to character
    // From http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
    private let characterEntities : [ String : Character ] = [
        // XML predefined entities:
        "&quot;"    : "\"",
        "&amp;"     : "&",
        "&apos;"    : "'",
        "&lt;"      : "<",
        "&gt;"      : ">",
    
        // HTML character entity references:
        "&nbsp;"    : "\u{00a0}",
        // ...
        "&diams;"   : "♦",
    ]
    
    extension String {
    
        /// Returns a new string made by replacing in the `String`
        /// all HTML character entity references with the corresponding
        /// character.
        var stringByDecodingHTMLEntities : String {
    
            // ===== Utility functions =====
    
            // Convert the number in the string to the corresponding
            // Unicode character, e.g.
            //    decodeNumeric("64", 10)   --> "@"
            //    decodeNumeric("20ac", 16) --> "€"
            func decodeNumeric(string : String, base : Int32) -> Character? {
                let code = UInt32(strtoul(string, nil, base))
                return Character(UnicodeScalar(code))
            }
    
            // Decode the HTML character entity to the corresponding
            // Unicode character, return `nil` for invalid input.
            //     decode("&#64;")    --> "@"
            //     decode("&#x20ac;") --> "€"
            //     decode("&lt;")     --> "<"
            //     decode("&foo;")    --> nil
            func decode(entity : String) -> Character? {
    
                if entity.hasPrefix("&#x") || entity.hasPrefix("&#X"){
                    return decodeNumeric(entity.substringFromIndex(entity.startIndex.advancedBy(3)), base: 16)
                } else if entity.hasPrefix("&#") {
                    return decodeNumeric(entity.substringFromIndex(entity.startIndex.advancedBy(2)), base: 10)
                } else {
                    return characterEntities[entity]
                }
            }
    
            // ===== Method starts here =====
    
            var result = ""
            var position = startIndex
    
            // Find the next '&' and copy the characters preceding it to `result`:
            while let ampRange = self.rangeOfString("&", range: position ..< endIndex) {
                result.appendContentsOf(self[position ..< ampRange.startIndex])
                position = ampRange.startIndex
    
                // Find the next ';' and copy everything from '&' to ';' into `entity`
                if let semiRange = self.rangeOfString(";", range: position ..< endIndex) {
                    let entity = self[position ..< semiRange.endIndex]
                    position = semiRange.endIndex
    
                    if let decoded = decode(entity) {
                        // Replace by decoded character:
                        result.append(decoded)
                    } else {
                        // Invalid entity, copy verbatim:
                        result.appendContentsOf(entity)
                    }
                } else {
                    // No matching ';'.
                    break
                }
            }
            // Copy remaining characters to `result`:
            result.appendContentsOf(self[position ..< endIndex])
            return result
        }
    }
    
    0 讨论(0)
  • 2020-11-22 02:47

    Swift 4

    extension String {
        var replacingHTMLEntities: String? {
            do {
                return try NSAttributedString(data: Data(utf8), options: [
                    .documentType: NSAttributedString.DocumentType.html,
                    .characterEncoding: String.Encoding.utf8.rawValue
                ], documentAttributes: nil).string
            } catch {
                return nil
            }
        }
    }
    

    Simple Usage

    let clean = "Weeknd &#8216;King Of The Fall&#8217".replacingHTMLEntities ?? "default value"
    
    0 讨论(0)
  • 2020-11-22 02:48

    This answer was last revised for Swift 5.2 and iOS 13.4 SDK.


    There's no straightforward way to do that, but you can use NSAttributedString magic to make this process as painless as possible (be warned that this method will strip all HTML tags as well).

    Remember to initialize NSAttributedString from main thread only. It uses WebKit to parse HTML underneath, thus the requirement.

    // This is a[0]["title"] in your case
    let encodedString = "The Weeknd <em>&#8216;King Of The Fall&#8217;</em>"
    
    guard let data = htmlEncodedString.data(using: .utf8) else {
        return
    }
    
    let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
        .documentType: NSAttributedString.DocumentType.html,
        .characterEncoding: String.Encoding.utf8.rawValue
    ]
    
    guard let attributedString = try? NSAttributedString(data: data, options: options, documentAttributes: nil) else {
        return
    }
    
    // The Weeknd ‘King Of The Fall’
    let decodedString = attributedString.string
    
    extension String {
    
        init?(htmlEncodedString: String) {
    
            guard let data = htmlEncodedString.data(using: .utf8) else {
                return nil
            }
    
            let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
                .documentType: NSAttributedString.DocumentType.html,
                .characterEncoding: String.Encoding.utf8.rawValue
            ]
    
            guard let attributedString = try? NSAttributedString(data: data, options: options, documentAttributes: nil) else {
                return nil
            }
    
            self.init(attributedString.string)
    
        }
    
    }
    
    let encodedString = "The Weeknd <em>&#8216;King Of The Fall&#8217;</em>"
    let decodedString = String(htmlEncodedString: encodedString)
    
    0 讨论(0)
  • 2020-11-22 02:48
    extension String{
        func decodeEnt() -> String{
            let encodedData = self.dataUsingEncoding(NSUTF8StringEncoding)!
            let attributedOptions : [String: AnyObject] = [
                NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
                NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding
            ]
            let attributedString = NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil, error: nil)!
    
            return attributedString.string
        }
    }
    
    let encodedString = "The Weeknd &#8216;King Of The Fall&#8217;"
    
    let foo = encodedString.decodeEnt() /* The Weeknd ‘King Of The Fall’ */
    
    0 讨论(0)
提交回复
热议问题