Swift 4 base64 String to Data not working due to String containing “incomplete” emoji

问题

I am coming from this post Swift 4 JSON String with unknown UTF8 "�" character is not convertible to Data/ Dictionary but meanwhile I was able to isolate the issue to a 10-character-string.

Short intro: one user's app did not show any content. Looking at his 6kb of data in plain text with TextWrangler I found 2 red question marks

I tried to cut some chunks of the base64-encoded data around the question marks and convert them to Data which didn't work. As soon as I removed the bits from the red question mark from the chunks it seemed to work again. Please take a look at my following Playground example:

//those do NOT work
let toEndBracket = "ACAAKgBVAFMAQQAqACAnlgAg2DwAIgB9AF0A" // *USA* ' <"}]//
let toMidBracket = "ACAAKgBVAFMAQQAqACAnlgAg2DwAIgB9"     // *USA* ' <"}//
let toCarrot =     "ACAAKgBVAFMAQQAqACAnlgAg2DwA"         // *USA* ' <//
let toSpace =      "ACAAKgBVAFMAQQAqACAnlgAg"             // *USA* ' //

//but this one WORKS
let toApostrophe = "ACAAKgBVAFMAQQAqACAn"                 // *USA* '//
//(basically the last one is without the space before the carrot, I've added the slashes after it to emphasize that)
//clear strings taken from https://www.base64decode.org/ using the UTF-8 setting WITHOUT "Live mode".

if let textData = Data(base64Encoded: toApostrophe) {
    print("Data created")   //works for all of them
    print(textData)
    if let decodedString = String(data: textData, encoding: .utf8) {
        print("WORKED!!!")  //only happens for the toApostrophe
        print(decodedString)
    } else {
        print("DID NOT WORK")
    }
}

So it basically fails as soon as soon as it contains lgAg. Replacing this with something like U29t does make the small strings work again but I can't do this in production code as I am sure my examples aren't the only occurrences of this issue. I don't care what happens with the original characters/ symbols/ emojis that are causing this, if there was a way to just "ignore" them that would be more than helpful already!

Here is another example of where this occurs:

//OTHER SYMBOL WITH SAME BEHAVIOR
//not working
let secondFromSpace =  "ACDYPAAiACwA"       // <",//

//WORKING
let secondFromCarrot = "PAAiACwA"           //<",//

Here is the original text in its habitat, a messenger message saying "USA" with an emoji hence the "USA" in my examples texts and my suspicion it's the emojis that make it break:

I'd be grateful if someone can tell me how I can "clean up" the base64 string so it's convertible to data again. It might also be due to some weird encoding with some of the emojis but for the very most cases, the app receives and displays content with emojis just fine.

I have finally figured out why this is happening. It's not a swift-side solution to my problem but now it makes at least some sense. For previews of new content I cut off strings to match the viewport of the browser. This particular unlucky user has had the USA flag emoji on the edge of the display bezel. Never would I have thought of emojis consisting of multiple letters and JavaScript's substring() decapitating them. Take a look at the picture, this explains where the character comes from etc.

I would still appreciate an answer as to how to avoid/ignore/catch that in Swift but to every poor soul running into this issue I hope you will stumble across this thread.

回答1:

(Some of this is out of comments, but trying to bring it together and describe solutions.)

First, your strings are not UTF-8. They're UTF-16 or malformed UTF-16. Sometimes UTF-16 happens to be interpretable as UTF-8, but when it is, there will be NULL characters scattered through the string. In your "working" example, it's not really working.

let toApostrophe = "ACAAKgBVAFMAQQAqACAn"                 // *USA* '//
if let textData = Data(base64Encoded: toApostrophe) {
    if let decodedString = String(data: textData, encoding: .utf8) {
        print(decodedString)
        print(decodedString.count)
        print(decodedString.map { $0.unicodeScalars.map { $0.value } } )
    } else {
        print("DID NOT DECODE UTF8")
    }
} else {
    print("DID NOT DECODE BASE64")
}

Prints:

 *USA* '
15
[[0], [32], [0], [42], [0], [85], [0], [83], [0], [65], [0], [42], [0], [32], [39]]

Note that the length of string is 15 characters, not 8 like you were probably expecting. That's because it includes an extra invisible NULL (0) between most characters.

toEndBracket doesn't happen to be legal UTF-8, however. Here are its bytes:

["00", "20", "00", "2a", "00", "55", "00", "53", "00", "41", "00", "2a", "00", "20", "27", "96", "00", "20", "d8", "3c", "00", "22", "00", "7d", "00", "5d", "00"]

This is ok until it gets to 0xd8. That starts with the bits 110, which indicates that it's the start of a two byte sequence. But the next byte is 0x3c, which is not a valid second byte of a multi-byte sequence (it should start with 10, but it starts with 00). So we can't decode this as UTF-8. Even using decodeCString(_:as:repairingInvalidCodeUnits) can't decode this string because it's filled with embedded NULLs. You've got to decode it using at least the right encoding.

But let's do that. Decode as UTF-16. At least that's close, even though it's slightly invalid UTF-16.

let toEndBracket16 = String(data: toEndBracketData, encoding: .utf16)
// " *USA* ➖ �"}]"

Now we can at least work with this. It's invalid JSON, though. So we can strip that by filtering it:

let legalJSON = String(toEndBracket16.filter { $0 != "\u{FFFD}" })
// " *USA* ➖ "}]"

I don't really recommend this approach. It's incredibly fragile and based on broken input. Fix the input. But in a world where you're trying to parse broken input, these are the tools.

来源：https://stackoverflow.com/questions/52524382/swift-4-base64-string-to-data-not-working-due-to-string-containing-incomplete

标签

swift

character-encoding

base64

iso