How to deal with buffered strings from C in Swift?

假如想象 提交于 2019-12-04 12:43:15

If processing speed is your first goal then I would just collect all characters until the XML element is processed completely and endElement is called. This can be done using NSMutableData from the Foundation framework. So you need a property

var charData : NSMutableData?

which is initialized in startElement:

charData = NSMutableData()

In the characters callback you append all data:

charData!.appendBytes(ch, length: Int(len))

(The forced unwrapping is acceptable here. charData can only be nil if startElement has not been called before, which means that you made a programming error or libxml2 is not working correctly).

Finally in endElement, create a Swift string and release the data:

defer {
    // Release data in any case before function returns
    charData = nil
}
guard let string =  String(data: charData!, encoding: NSUTF8StringEncoding) else {
    // Handle invalid UTF-8 data situation
} 
// string is the Swift string 

The longest legal UTF-8 character is 4 bytes (RFC 3629 Section 3). So you don't need a very big buffer to keep yourself safe. The rules for how many bytes you'll need are pretty easy, too (just look at the first byte). So I would just maintain an buffer that holds from 0 to 3 bytes. When you have the right number, pass it along and try to construct a String. Something like this (only lightly tested, may have corner cases that don't work still):

final class UTF8Parser {
    enum Error: ErrorType {
        case BadEncoding
    }
    var workingBytes: [UInt8] = []

    func updateWithBytes(bytes: [UInt8]) throws -> String {

        workingBytes += bytes

        var string = String()
        var index = 0

        while index < workingBytes.count {
            let firstByte = workingBytes[index]
            var numBytes = 0

                 if firstByte < 0x80 { numBytes = 1 }
            else if firstByte < 0xE0 { numBytes = 2 }
            else if firstByte < 0xF0 { numBytes = 3 }
            else                     { numBytes = 4 }

            if workingBytes.count - index < numBytes {
                break
            }

            let charBytes = workingBytes[index..<index+numBytes]

            guard let newString = String(bytes: charBytes, encoding: NSUTF8StringEncoding) else {
                throw(Error.BadEncoding)
            }
            string += newString
            index += numBytes
        }

        workingBytes.removeFirst(index)
        return string
    }
}

let parser = UTF8Parser()
var string = ""
string += try parser.updateWithBytes([UInt8(65)])

print(string)
let partial = try parser.updateWithBytes([UInt8(0xCC)])
print(partial)

let rest = try parser.updateWithBytes([UInt8(0x81)])
print(rest)

string += rest
print(string)

This is just one way that's kind of straightforward. Another approach that is probably faster would be to walk backwards through the bytes, looking for the last start of code point (a byte that doesn't start with "10"). Then you could process everything up to that point in one fell swoop, and special-case just the last few bytes.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!