Using Swift, how do you re-encode then decode a String like this short script in Python?

穿精又带淫゛_ 提交于 2021-01-28 19:36:34

问题


XKCD has some issues with their API and weird encoding issues.

Minor encoding issue with xkcd alt texts in chat

The solution (in Python) is to encode it as latin1 then decode as utf8, but how do I do this in Swift?

Test string:

"Be careful\u00e2\u0080\u0094it's breeding season"

Expected output:

Be careful—it's breeding season

Python (from above link):

import json
a = '''"Be careful\u00e2\u0080\u0094it's breeding season"'''
print(json.loads(a).encode('latin1').decode('utf8'))

How is this done in Swift?

let strdata = "Be careful\\u00e2\\u0080\\u0094it's breeding season".data(using: .isoLatin1)!
let str = String(data: strdata, encoding: .utf8)

That doesn't work!


回答1:


You have to decode the JSON data first, then extract the string, and finally “fix” the string. Here is a self-contained example with the JSON from https://xkcd.com/1814/info.0.json:

let data = """
    {"month": "3", "num": 1814, "link": "", "year": "2017", "news": "",
    "safe_title": "Color Pattern", "transcript": "",
    "alt": "\\u00e2\\u0099\\u00ab When the spacing is tight / And the difference is slight / That's a moir\\u00c3\\u00a9 \\u00e2\\u0099\\u00ab",
    "img": "https://imgs.xkcd.com/comics/color_pattern.png",
    "title": "Color Pattern", "day": "22"}
""".data(using: .utf8)!

// Alternatively:
// let url = URL(string: "https://xkcd.com/1814/info.0.json")!
// let data = try! Data(contentsOf: url)

do {
    if let dict = (try JSONSerialization.jsonObject(with: data, options: [])) as? [String: Any],
        var alt = dict["alt"] as? String {

        // Now try fix the "alt" string
        if let isoData = alt.data(using: .isoLatin1),
            let altFixed = String(data: isoData, encoding: .utf8) {
            alt = altFixed
        }

        print(alt)
        // ♫ When the spacing is tight / And the difference is slight / That's a moiré ♫
    }
} catch {
    print(error)
}

If you have just a string of the form

Be careful\u00e2\u0080\u0094it's breeding season

then you can still use JSONSerialization to decode the \uNNNN escape sequences, and then continue as above.

A simple example (error checking omitted for brevity):

let strbad = "Be careful\\u00e2\\u0080\\u0094it's breeding season"
let decoded = try! JSONSerialization.jsonObject(with: Data("\"\(strbad)\"".utf8), options: .allowFragments) as! String
let strgood = String(data: decoded.data(using: .isoLatin1)!, encoding: .utf8)!
print(strgood)
// Be careful—it's breeding season



回答2:


I couldn't find anything built in, but I did manage to write this for you.

extension String {
    func range(nsRange: NSRange) -> Range<Index> {
        return Range(nsRange, in: self)!
    }

    func nsRange(range: Range<Index>) -> NSRange {
        return NSRange(range, in: self)
    }

    var fullRange: Range<Index> {
        return startIndex..<endIndex
    }

    var fullNSRange: NSRange {
        return nsRange(range: fullRange)
    }

    subscript(nsRange: NSRange) -> Substring {
        return self[range(nsRange: nsRange)]
    }

    func convertingUnicodeCharacters() -> String {
        var string = self
        // Characters need to be replaced in groups in case of clusters
        let groupedRegex = try! NSRegularExpression(pattern: "(\\\\u[0-9a-fA-F]{1,8})+")
        for match in groupedRegex.matches(in: string, range: string.fullNSRange).reversed() {
            let groupedHexValues = String(string[match.range])
            var characters = [Character]()
            let regex = try! NSRegularExpression(pattern: "\\\\u([0-9a-fA-F]{1,8})")
            for hexMatch in regex.matches(in: groupedHexValues, range: groupedHexValues.fullNSRange) {
                let hexString = groupedHexValues[Range(hexMatch.range(at: 1), in: string)!]
                if let hexValue = UInt32(hexString, radix: 16),
                    let scalar = UnicodeScalar(hexValue) {
                    characters.append(Character(scalar))
                }
            }
            string.replaceSubrange(Range(match.range, in: string)!, with: characters)
        }
        return string
    }
}

It basically looks for any \u<1-8 digit hex> values and converts them into scalars. Should be fairly straightforward... 🧐 I've tried to test it a fair but but not sure if it catches every edge case.

My playground testing code was simply:

let string = "Be careful\\u00e2\\u0080\\u0094-\\u1F496\\u65\\u301it's breeding season"
let expected = "Be careful\u{00e2}\u{0080}\u{0094}-\u{1f496}\u{65}\u{301}it's breeding season"
string.convertingUnicodeCharacters() == expected // true 🎉


来源:https://stackoverflow.com/questions/52387450/using-swift-how-do-you-re-encode-then-decode-a-string-like-this-short-script-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!