问题
I am dealing with strings containing \r\n
with Swift 4.2. I ran into kind of strange behavior of Swift index, it appears \r\n
will be treated as one character instead of two by Swift indexing methods. I wrote a piece of code to present this behavior:
var text = "ABC\r\n\r\nDEF"
func printChar(_ lower: Int, _ upper: Int) {
let start = text.index(text.startIndex, offsetBy: lower)
let end = text.index(text.startIndex, offsetBy: upper)
print("\"" + text[start..<end] + "\"")
}
printChar(0, 1) // "A"
printChar(1, 2) // "B"
printChar(2, 3) // "C"
printChar(3, 4) // new line
printChar(4, 5) // new line (okay, what's going on here?)
printChar(5, 6) // "D"
printChar(6, 7) // "E"
printChar(7, 8) // "F"
The print result will be
"A"
"B"
"C"
"
"
"
"
"D"
"E"
"F"
Any idea why it's like this?
回答1:
TLDR: \r\n
is a grapheme cluster and is treated as a single Character
in Swift because Unicode.
Swift treats
\r\n
as oneCharacter
.Objective-C
NSString
treats it as two characters (in terms of the result fromlength
).
On the swift-users forum someone wrote:
– "\r\n" is a single
Character
. Is this the correct behaviour?– Yes, a
Character
corresponds to a Unicode grapheme cluster, and "\r\n" is considered a single grapheme cluster.
And the subsequent response posted a link to Unicode documentation, check out this table which officially states CRLF is a grapheme cluster.
Take a look at the Apple documentation on Characters and Grapheme Clusters.
It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.
The Swift documentation on Strings and Characters is also worth reading.
This overview from objc.io is interesting as well.
NSString
represents UTF-16-encoded text. Length, indices, and ranges are all based on UTF-16 code units.
Another example of this is an emoji like 👍🏻. This single character is actually %uD83D%uDC4D%uD83C%uDFFB, four different unicode scalars. But if you called count
on a string with just that emoji you'd (correctly) get 1
.
If you wanted to see the scalars you could iterate them as follows:
for scalar in text.unicodeScalars {
print("\(scalar.value) ", terminator: "")
}
Which for "\r\n"
would give you 13 10
In the Swift documentation you'll find why NSString
is different:
The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.
Thus this isn't really "strange" behaviour of Swift string indexing, but rather a result of how Unicode treats these characters and how String
in Swift is designed. Swift string indexing goes by Character
and \r\n
is a single Character
.
来源:https://stackoverflow.com/questions/53940147/swift-string-indexing-combines-r-n-as-one-char-instead-of-two