What is a safe way to turn streamed (utf8) data into a string?

前端 未结 1 1126
梦如初夏
梦如初夏 2021-01-06 09:03

Suppose I\'m a server written in objc/swift. The client is sending me a large amount of data, which is really a large utf8 encoded string. As the server, i have my NSInputSt

相关标签:
1条回答
  • 2021-01-06 09:50

    The tool you probably want to use here is UTF8. It will handle all the state issues for you. See How to cast decrypted UInt8 to String? for a simple example that you can likely adapt.

    The major concern in building up a string from UTF-8 data isn't composed characters, but rather multi-byte characters. "LATIN SMALL LETTER A" + "COMBINING GRAVE ACCENT" works fine even if decode each of those characters separately. What doesn't work is gathering the first byte of 你, decoding it, and then appending the decoded second byte. The UTF8 type will handle this for you, though. All you need to do is bridge your NSInputStream to a GeneratorType.

    Here's a basic (not fully production-ready) example of what I'm talking about. First, we need a way to convert an NSInputStream into a generator. That's probably the hardest part:

    final class StreamGenerator {
        static let bufferSize = 1024
        let stream: NSInputStream
        var buffer = [UInt8](count: StreamGenerator.bufferSize, repeatedValue: 0)
        var buffGen = IndexingGenerator<ArraySlice<UInt8>>([])
    
        init(stream: NSInputStream) {
            self.stream = stream
            stream.open()
        }
    }
    
    extension StreamGenerator: GeneratorType {
        func next() -> UInt8? {
            // Check the stream status
            switch stream.streamStatus {
            case .NotOpen:
                assertionFailure("Cannot read unopened stream")
                return nil
            case .Writing:
                preconditionFailure("Impossible status")
            case .AtEnd, .Closed, .Error:
                return nil // FIXME: May want a closure to post errors
            case .Opening, .Open, .Reading:
                break
            }
    
            // First see if we can feed from our buffer
            if let result = buffGen.next() {
                return result
            }
    
            // Our buffer is empty. Block until there is at least one byte available
            let count = stream.read(&buffer, maxLength: buffer.capacity)
    
            if count <= 0 { // FIXME: Probably want a closure or something to handle error cases
                stream.close()
                return nil
            }
    
            buffGen = buffer.prefix(count).generate()
            return buffGen.next()
        }
    }
    

    Calls to next() can block here, so it should not be called on the main queue, but other than that, it's a standard Generator that spits out bytes. (This is also the piece that probably has lots of little corner cases that I'm not handling, so you want to think this through pretty carefully. Still, it's not that complicated.)

    With that, creating a UTF-8 decoding generator is almost trivial:

    final class UnicodeScalarGenerator<ByteGenerator: GeneratorType where ByteGenerator.Element == UInt8> {
        var byteGenerator: ByteGenerator
        var utf8 = UTF8()
        init(byteGenerator: ByteGenerator) {
            self.byteGenerator = byteGenerator
        }
    }
    
    extension UnicodeScalarGenerator: GeneratorType {
        func next() -> UnicodeScalar? {
            switch utf8.decode(&byteGenerator) {
            case .Result(let scalar): return scalar
            case .EmptyInput: return nil
            case .Error: return nil // FIXME: Probably want a closure or something to handle error cases
            }
        }
    }
    

    You could of course trivially turn this into a CharacterGenerator instead (using Character(_:UnicodeScalar)).

    The last problem is if you want to combine all combining marks, such that "LATIN SMALL LETTER A" followed by "COMBINING GRAVE ACCENT" would always be returned together (rather than as the two characters they are). That's actually a bit trickier than it sounds. First, you'd need to generate Strings, not Characters. And then you'd need a good way to know what all the combining characters are. That's certainly knowable, but I'm having a little trouble deriving a simple algorithm. There's no "combiningMarkCharacterSet" in Cocoa. I'm still thinking about it. Getting something that "mostly works" is easy, but I'm not sure yet how to build it so that it's correct for all of Unicode.

    Here's a little sample program to try it out:

        let textPath = NSBundle.mainBundle().pathForResource("text.txt", ofType: nil)!
        let inputStream = NSInputStream(fileAtPath: textPath)!
        inputStream.open()
    
        dispatch_async(dispatch_get_global_queue(0, 0)) {
            let streamGen = StreamGenerator(stream: inputStream)
            let unicodeGen = UnicodeScalarGenerator(byteGenerator: streamGen)
            var string = ""
            for c in GeneratorSequence(unicodeGen) {
                print(c)
                string += String(c)
            }
            print(string)
        }
    

    And a little text to read:

    Here is some normalish álfa你好 text
    And some Zalgo i̝̲̲̗̹̼n͕͓̘v͇̠͈͕̻̹̫͡o̷͚͍̙͖ke̛̘̜̘͓̖̱̬ composed stuff
    And one more line with no newline
    

    (That second line is some Zalgo encoded text, which is nice for testing.)

    I haven't done any testing with this in a real blocking situation, like reading from the network, but it should work based on how NSInputStream works (i.e. it should block until there's at least one byte to read, but then should just fill the buffer with whatever's available).

    I've made all of this match GeneratorType so that it plugs into other things easily, but error handling might work better if you didn't use GeneratorType and instead created your own protocol with next() throws -> Self.Element instead. Throwing would make it easier to propagate errors up the stack, but would make it harder to plug into for...in loops.

    0 讨论(0)
提交回复
热议问题