Guess encoding when creating an NSString from NSData

前端 未结 2 602
盖世英雄少女心
盖世英雄少女心 2020-12-08 16:19

When reading an NSString from a file I can use initWithContentsOfFile:usedEncoding:error: and it will guess the encoding of the file.

When

相关标签:
2条回答
  • 2020-12-08 16:46

    In iOS 8 and OS X 10.10 there is a new API on NSString:

    Objective-C

    + (NSStringEncoding)stringEncodingForData:(NSData *)data
                              encodingOptions:(NSDictionary *)opts
                              convertedString:(NSString **)string
                          usedLossyConversion:(BOOL *)usedLossyConversion;
    

    Swift

    open class func stringEncoding(for data: Data,
                       encodingOptions opts: [StringEncodingDetectionOptionsKey : Any]? = nil, 
                     convertedString string: AutoreleasingUnsafeMutablePointer<NSString?>?, 
                        usedLossyConversion: UnsafeMutablePointer<ObjCBool>?) -> UInt
    

    Now you can let the framework do the guess and in my experience that works really well!

    From the header (the documentation does not state the method at the moment but it was officially mentioned in WWDC Session 204 (page 270):

    1. an array of suggested string encodings (without specifying the 3rd option in this list, all string encodings are considered but the ones in the array will have a higher preference; moreover, the order of the encodings in the array is important: the first encoding has a higher preference than the second one in the array)
    2. an array of string encodings not to use (the string encodings in this list will not be considered at all)
    3. a boolean option indicating whether only the suggested string encodings are considered
    4. a boolean option indicating whether lossy is allowed
    5. an option that gives a specific string to substitude for mystery bytes
    6. the current user's language
    7. a boolean option indicating whether the data is generated by Windows

    If the values in the dictionary have wrong types (for example, the value of NSStringEncodingDetectionSuggestedEncodingsKey is not an array), an exception is thrown.

    If the values in the dictionary are unknown (for example, the value in the array of suggested string encodings is not a valid encoding), the values will be ignored.

    Example (Swift):

    var convertedString: NSString?
    let encoding = NSString.stringEncoding(for: data, encodingOptions: nil, convertedString: &convertedString, usedLossyConversion: nil)
    

    If you just want the decoded string and don't care about the encoding you can remove the let encoding =

    0 讨论(0)
  • 2020-12-08 16:47

    In general, you can’t. However, you can quite reliably identify UTF-8 files – if a file is valid UTF-8, it’s not very likely that it’s supposed to be any other encoding (except if all the bytes are in the ASCII range, in which case any “extended ASCII” encoding, including UTF-8, will give you the same result). All Unicode encodings also have an optional BOM which identifies them. So a reasonable approach would be:

    • Look for a valid BOM. If there is one, use the appropriate encoding.
    • Otherwise, try to interpret it as UTF-8. You can do this by calling initWithData:data encoding:NSUTF8StringEncoding and checking if the result is non-nil.
    • If that fails, use a default 8-bit encoding, such as -[NSString defaultCStringEncoding] (which provides a locale-appropriate guess).

    It is possible to try to improve the guess in the last step by trying various different encodings and choosing the one which has fewest sequences of letters with junk in the middle, where “junk” is any character that’s not a letter, space or common punctuation mark. This would significantly increase complexity while not actually being reliable.

    In short, to be able to handle all available encodings you need to do what TextEdit does: shunt the decision over to the user.

    Oh, one more thing: as of 10.5, the encoding is often stored with a file in the undocumented com.apple.TextEncoding extended attribute. If you open a file with +[NSString stringWithContentsOfFile:] or similar, this will automatically be used if present.

    0 讨论(0)
提交回复
热议问题