NSArray from NSCharacterSet

☆樱花仙子☆ 提交于 2019-11-26 11:21:01

The following code creates an array containing all characters of a given character set. It works also for characters outside of the "basic multilingual plane" (characters > U+FFFF, e.g. U+10400 DESERET CAPITAL LETTER LONG I).

NSCharacterSet *charset = [NSCharacterSet uppercaseLetterCharacterSet];
NSMutableArray *array = [NSMutableArray array];
for (int plane = 0; plane <= 16; plane++) {
    if ([charset hasMemberInPlane:plane]) {
        UTF32Char c;
        for (c = plane << 16; c < (plane+1) << 16; c++) {
            if ([charset longCharacterIsMember:c]) {
                UTF32Char c1 = OSSwapHostToLittleInt32(c); // To make it byte-order safe
                NSString *s = [[NSString alloc] initWithBytes:&c1 length:4 encoding:NSUTF32LittleEndianStringEncoding];
                [array addObject:s];
            }
        }
    }
}

For the uppercaseLetterCharacterSet this gives an array of 1467 elements. But note that characters > U+FFFF are stored as UTF-16 surrogate pair in NSString, so for example U+10400 actually is stored in NSString as 2 characters "\uD801\uDC00".

Swift 2 code can be found in other answers to this question. Here is a Swift 3 version, written as an extension method:

extension CharacterSet {
    func allCharacters() -> [Character] {
        var result: [Character] = []
        for plane: UInt8 in 0...16 where self.hasMember(inPlane: plane) {
            for unicode in UInt32(plane) << 16 ..< UInt32(plane + 1) << 16 {
                if let uniChar = UnicodeScalar(unicode), self.contains(uniChar) {
                    result.append(Character(uniChar))
                }
            }
        }
        return result
    }
}

Example:

let charset = CharacterSet.uppercaseLetters
let chars = charset.allCharacters()
print(chars.count) // 1521
print(chars) // ["A", "B", "C", ... "]

(Note that some characters may not be present in the font used to display the result.)

Since characters have a limited, finite (and not too wide) range, you can just test which characters are members of a given character set (brute force):

// this doesn't seem to be available
#define UNICHAR_MAX (1ull << (CHAR_BIT * sizeof(unichar)))

NSData *data = [[NSCharacterSet uppercaseLetterCharacterSet] bitmapRepresentation];
uint8_t *ptr = [data bytes];
NSMutableArray *allCharsInSet = [NSMutableArray array];
// following from Apple's sample code
for (unichar i = 0; i < UNICHAR_MAX; i++) {
    if (ptr[i >> 3] & (1u << (i & 7))) {
        [allCharsInSet addObject:[NSString stringWithCharacters:&i length:1]];
    }
}

Remark: Due to the size of a unichar and the structure of the additional segments in bitmapRepresentation, this solution only works for characters <= 0xFFFF and is not suitable for higher planes.

Inspired by Satachito answer, here is a performant way to make an Array from CharacterSet using bitmapRepresentation:

extension CharacterSet {
    func characters() -> [Character] {
        // A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive.
        return codePoints().compactMap { UnicodeScalar($0) }.map { Character($0) }
    }

    func codePoints() -> [Int] {
        var result: [Int] = []
        var plane = 0
        // following documentation at https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation
        for (i, w) in bitmapRepresentation.enumerated() {
            let k = i % 8193
            if k == 8192 {
                // plane index byte
                plane = Int(w) << 13
                continue
            }
            let base = (plane + k) << 3
            for j in 0 ..< 8 where w & 1 << j != 0 {
                result.append(base + j)
            }
        }
        return result
    }
}

Example for uppercaseLetters

let charset = CharacterSet.uppercaseLetters
let chars = charset.characters()
print(chars.count) // 1733
print(chars) // ["A", "B", "C", ... "]

Example for discontinuous planes

let charset = CharacterSet(charactersIn: "𝚨󌞑")
let codePoints = charset.codePoints()
print(codePoints) // [120488, 837521]

Performances

Very good: this solution built in release with bitmapRepresentation seems 3 to 10 times faster than Martin R's solution with contains or Oliver Atkinson's solution with longCharacterIsMember.

I created a Swift (v2.1) version of Martin R's algorithm:

let charset = NSCharacterSet.URLPathAllowedCharacterSet();

for var plane : UInt8 in 0...16 {
    if charset.hasMemberInPlane( plane ) {
        var c : UTF32Char;

        for var c : UInt32 = UInt32( plane ) << 16; c < (UInt32(plane)+1) << 16; c++ {
            if charset.longCharacterIsMember(c) {
                var c1 = c.littleEndian // To make it byte-order safe
                let s = NSString(bytes: &c1, length: 4, encoding: NSUTF32LittleEndianStringEncoding);
                NSLog("Char: \(s)");
            }
        }
    }
}

This is done using a little more of swift for swift.

let characters = NSCharacterSet.uppercaseLetterCharacterSet()
var array      = [String]()

for plane: UInt8 in 0...16 where characters.hasMemberInPlane(plane) {

  for character: UTF32Char in UInt32(plane) << 16..<(UInt32(plane) + 1) << 16 where characters.longCharacterIsMember(character) {

    var endian = character.littleEndian
    let string = NSString(bytes: &endian, length: 4, encoding: NSUTF32LittleEndianStringEncoding) as! String

    array.append(string)

  }

}

print(array)

For just A-Z of the Latin alphabet (nothing with Greek, or diacritical marks, or other things that were not what the guy asked for):

for plane: UInt8 in 0...16 where characters.hasMemberInPlane(plane) {
    i = 0
    for character: UTF32Char in UInt32(plane) << 16...(UInt32(plane) + 1) << 16 where characters.longCharacterIsMember(character) {
        var endian = character.littleEndian
        let string = NSString(bytes: &endian, length: 4, encoding: NSUTF32LittleEndianStringEncoding) as! String
        array.append(string)
        if(array.count == 26) {
            break
        }
    }
    if(array.count == 26) {
        break
    }
}

I found Martin R's solution to be too slow for my purposes, so I solved it another way using CharacterSet's bitmapRepresentation property.

This is significantly faster according to my benchmarks:

var ranges = [CountableClosedRange<UInt32>]()
let bitmap: Data = characterSet.bitmapRepresentation
var first: UInt32?, last: UInt32?
var plane = 0, nextPlane = 8192
for (j, byte) in bitmap.enumerated() where byte != 0 {
    if j == nextPlane {
        plane += 1
        nextPlane += 8193
        continue
    }
    for i in 0 ..< 8 where byte & 1 << i != 0 {
        let codePoint = UInt32(j - plane) * 8 + UInt32(i)
        if let _last = last, codePoint == _last + 1 {
            last = codePoint
        } else {
            if let first = first, let last = last {
                ranges.append(first ... last)
            }
            first = codePoint
            last = codePoint
        }
    }
}
if let first = first, let last = last {
    ranges.append(first ... last)
}
return ranges

This solution returns an array of codePoint ranges, but you can easily adapt it to return individual characters or strings, etc.

You should not; this is not the purpose of a character set. A NSCharacterSet is a possibly-infinite set of characters, possibly in not-yet-invented code points. All you want to know is "Is this character or collection of characters in this set?", and to that end it is useful.

Imagine this Swift code:

let asciiCodepoints = Unicode.Scalar(0x00)...Unicode.Scalar(0x7F)
let asciiCharacterSet = CharacterSet(charactersIn: asciiCodepoints)
let nonAsciiCharacterSet = asciiCharacterSet.inverted

Which is analogous to this Objective-C code:

NSRange asciiCodepoints = NSMakeRange(0x00, 0x7F);
NSCharacterSet * asciiCharacterSet = [NSCharacterSet characterSetWithRange:asciiCodepoints];
NSCharacterSet * nonAsciiCharacterSet = asciiCharacterSet.invertedSet;

It's easy to say "loop over all the characters in asciiCharacterSet"; that would just loop over all characters from U+0000 through U+007F. But what does it mean to loop over all the characters in nonAsciiCharacterSet? Do you start at U+0080? Who's to say there won't be negative codepoints in the future? Where do you end? Do you skip non-printable characters? What about extended grapheme clusters? Since it's a set (where order doesn't matter), can your code handle out-of-order codepoints in this loop?

These are questions you don't want to answer here; functionally nonAsciiCharacterSet is infinite, and all you want to use it for is to tell if any given character lies outside the set of ASCII characters.


The question you should really be asking yourself is: "What do I want to accomplish with this array of capital letters?" If (and likely only if) you really need to iterate over it in order, putting the ones you care about into an Array or String (perhaps one read in from a resource file) is probably the best way. If you want to check to see if a character is part of the set of uppercase letters, then you don't care about order or even how many characters are in the set, and should use CharacterSet.uppercaseLetters.contains(foo) (in Objective-C: [NSCharacterSet.uppercaseLetterCharacterSet contains: foo]).

Think, too, about non-latin characters. CharacterSet.uppercaseLetters covers Unicode General Categories Lu and Lt, which contain A through Z and also things like Dž, 𝕹, and Խ. You don't want to have to think about this. You definitely don't want to issue an update to your app when the Unicode Consortium adds new characters to this list. If what you want to do is decide whether something is upper-case, don't bother hard-coding anything.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!