Suppose I\'m a server written in objc/swift. The client is sending me a large amount of data, which is really a large utf8 encoded string. As the server, i have my NSInputSt
The tool you probably want to use here is UTF8. It will handle all the state issues for you. See How to cast decrypted UInt8 to String? for a simple example that you can likely adapt.
The major concern in building up a string from UTF-8 data isn't composed characters, but rather multi-byte characters. "LATIN SMALL LETTER A" + "COMBINING GRAVE ACCENT" works fine even if decode each of those characters separately. What doesn't work is gathering the first byte of 你, decoding it, and then appending the decoded second byte. The UTF8
type will handle this for you, though. All you need to do is bridge your NSInputStream
to a GeneratorType
.
Here's a basic (not fully production-ready) example of what I'm talking about. First, we need a way to convert an NSInputStream
into a generator. That's probably the hardest part:
final class StreamGenerator {
static let bufferSize = 1024
let stream: NSInputStream
var buffer = [UInt8](count: StreamGenerator.bufferSize, repeatedValue: 0)
var buffGen = IndexingGenerator<ArraySlice<UInt8>>([])
init(stream: NSInputStream) {
self.stream = stream
stream.open()
}
}
extension StreamGenerator: GeneratorType {
func next() -> UInt8? {
// Check the stream status
switch stream.streamStatus {
case .NotOpen:
assertionFailure("Cannot read unopened stream")
return nil
case .Writing:
preconditionFailure("Impossible status")
case .AtEnd, .Closed, .Error:
return nil // FIXME: May want a closure to post errors
case .Opening, .Open, .Reading:
break
}
// First see if we can feed from our buffer
if let result = buffGen.next() {
return result
}
// Our buffer is empty. Block until there is at least one byte available
let count = stream.read(&buffer, maxLength: buffer.capacity)
if count <= 0 { // FIXME: Probably want a closure or something to handle error cases
stream.close()
return nil
}
buffGen = buffer.prefix(count).generate()
return buffGen.next()
}
}
Calls to next()
can block here, so it should not be called on the main queue, but other than that, it's a standard Generator that spits out bytes. (This is also the piece that probably has lots of little corner cases that I'm not handling, so you want to think this through pretty carefully. Still, it's not that complicated.)
With that, creating a UTF-8 decoding generator is almost trivial:
final class UnicodeScalarGenerator<ByteGenerator: GeneratorType where ByteGenerator.Element == UInt8> {
var byteGenerator: ByteGenerator
var utf8 = UTF8()
init(byteGenerator: ByteGenerator) {
self.byteGenerator = byteGenerator
}
}
extension UnicodeScalarGenerator: GeneratorType {
func next() -> UnicodeScalar? {
switch utf8.decode(&byteGenerator) {
case .Result(let scalar): return scalar
case .EmptyInput: return nil
case .Error: return nil // FIXME: Probably want a closure or something to handle error cases
}
}
}
You could of course trivially turn this into a CharacterGenerator instead (using Character(_:UnicodeScalar)
).
The last problem is if you want to combine all combining marks, such that "LATIN SMALL LETTER A" followed by "COMBINING GRAVE ACCENT" would always be returned together (rather than as the two characters they are). That's actually a bit trickier than it sounds. First, you'd need to generate Strings, not Characters. And then you'd need a good way to know what all the combining characters are. That's certainly knowable, but I'm having a little trouble deriving a simple algorithm. There's no "combiningMarkCharacterSet" in Cocoa. I'm still thinking about it. Getting something that "mostly works" is easy, but I'm not sure yet how to build it so that it's correct for all of Unicode.
Here's a little sample program to try it out:
let textPath = NSBundle.mainBundle().pathForResource("text.txt", ofType: nil)!
let inputStream = NSInputStream(fileAtPath: textPath)!
inputStream.open()
dispatch_async(dispatch_get_global_queue(0, 0)) {
let streamGen = StreamGenerator(stream: inputStream)
let unicodeGen = UnicodeScalarGenerator(byteGenerator: streamGen)
var string = ""
for c in GeneratorSequence(unicodeGen) {
print(c)
string += String(c)
}
print(string)
}
And a little text to read:
Here is some normalish álfa你好 text And some Zalgo i̝̲̲̗̹̼n͕͓̘v͇̠͈͕̻̹̫͡o̷͚͍̙͖ke̛̘̜̘͓̖̱̬ composed stuff And one more line with no newline
(That second line is some Zalgo encoded text, which is nice for testing.)
I haven't done any testing with this in a real blocking situation, like reading from the network, but it should work based on how NSInputStream
works (i.e. it should block until there's at least one byte to read, but then should just fill the buffer with whatever's available).
I've made all of this match GeneratorType
so that it plugs into other things easily, but error handling might work better if you didn't use GeneratorType
and instead created your own protocol with next() throws -> Self.Element
instead. Throwing would make it easier to propagate errors up the stack, but would make it harder to plug into for...in
loops.