Why is `-lt` behaving differently for chars and strings?

懵懂的女人 提交于 2021-02-03 07:33:46

问题


I recently answered a SO-question about using -lt or -gt with strings. My answer was based on something I've read earlier which said that -lt compares one char from each string at a time until a ASCII-value is not equal to the other. At that point the result (lower/equal/greater) decides. By that logic, "Less" -lt "less" should return True because L has a lower ASCII-byte-value than l, but it doesn't:

[System.Text.Encoding]::ASCII.GetBytes("Less".ToCharArray())
76
101
115
115

[System.Text.Encoding]::ASCII.GetBytes("less".ToCharArray())
108
101
115
115

"Less" -lt "less"
False

It seems that I may have been missing a crucial piece: the test is case-insensitive

#L has a lower ASCII-value than l. PS doesn't care. They're equal
"Less" -le "less"
True

#The last s has a lower ASCII-value than t. PS cares.
"Less" -lt "lest"
True

#T has a lower ASCII-value than t. PS doesn't care
"LesT" -lt "lest"
False

#Again PS doesn't care. They're equal
"LesT" -le "lest"
True

I then tried to test char vs single-character-string:

[int][char]"L"
76

[int][char]"l"
108


#Using string it's case-insensitive. L = l
"L" -lt "l"
False

"L" -le "l"
True

"L" -gt "l"
False

#Using chars it's case-sensitive! L < l
([char]"L") -lt ([char]"l")
True

([char]"L") -gt ([char]"l")
False

For comparison, I tried to use the case-sensitive less-than operator, but it says L > l which is the opposite of what -lt returned for chars.

"L" -clt "l"
False

"l" -clt "L"
True

How does the comparison work, because it clearly isn't by using ASCII-value and why does it behave differently for chars vs. strings?


回答1:


A big thank-you to PetSerAl for all his invaluable input.

tl; dr:

  • -lt and -gt compare [char] instances numerically by Unicode codepoint.

    • Confusingly, so do -ilt, -clt, -igt, -cgt - even though they only make sense with string operands, but that's a quirk in the PowerShell language itself (see bottom).
  • -eq (and its alias -ieq), by contrast, compare [char] instances case-insensitively, which is typically, but not necessarily like a case-insensitive string comparison (-ceq again compares strictly numerically).

    • -eq/-ieq ultimately also compares numerically, but first converts the operands to their uppercase equivalents using the invariant culture; as a result, this comparison is not fully equivalent to PowerShell's string comparison, which additionally recognizes so-called compatible sequences (distinct characters or even sequences considered to have the same meaning; see Unicode equivalence) as equal.
    • In other words: PowerShell special-cases the behavior of only -eq / -ieq with [char] operands, and does so in a manner that is almost, but not quite the same as case-insensitive string comparison.
  • This distinction leads to counter-intuitive behavior such as [char] 'A' -eq [char] 'a' and [char] 'A' -lt [char] 'a' both returning $true.

  • To be safe:

    • always cast to [int] if you want numeric (Unicode codepoint) comparison.
    • always cast to [string] if you want string comparison.

For background information, read on.


PowerShell's usually helpful operator overloading can be tricky at times.

Note that in a numeric context (whether implicit or explicit), PowerShell treats characters ([char] ([System.Char]) instances) numerically, by their Unicode codepoint (not ASCII).

[char] 'A' -eq 65  # $true, in the 'Basic Latin' Unicode range, which coincides with ASCII
[char] 'Ā' -eq 256 # $true; 0x100, in the 'Latin-1 Supplement' Unicode range

What makes [char] unusual is that its instances are compared to each other numerically as-is, by Unicode codepoint, EXCEPT with -eq/-ieq.

  • ceq, -lt, and -gt compare directly by Unicode codepoints, and - counter-intuitively - so do -ilt, -clt, -igt and -cgt:
[char] 'A' -lt [char] 'a'  # $true; Unicode codepoint 65 ('A') is less than 97 ('a')
  • -eq (and its alias -ieq) first transforms the characters to uppercase, then compares the resulting Unicode codepoints:
[char] 'A' -eq [char] 'a' # !! ALSO $true; equivalent of 65 -eq 65

It's worth reflecting on this Buddhist turn: this and that: in the world of PowerShell, character 'A' is both less than and equal to 'a', depending on how you compare.

Also, directly or indirectly - after transformation to uppercase - comparing Unicode codepoints is NOT the same as comparing them as strings, because PowerShell's string comparison additionally recognizes so-called compatible sequences, where characters (or even character sequences) are considered "the same" if they have the same meaning (see Unicode equivalence); e.g.:

# Distinct Unicode characters U+2126 (Ohm Sign) and U+03A9 Greek Capital Letter Omega)
# ARE recognized as the "same thing" in a *string* comparison:
"Ω" -ceq "Ω"  # $true, despite having distinct Unicode codepoints

# -eq/ieq: with [char], by only applying transformation to uppercase, the results
# are still different codepoints, which - compared numerically - are NOT equal:
[char] 'Ω' -eq [char] 'Ω' # $false: uppercased codepoints differ

# -ceq always applies direct codepoint comparison.
[char] 'Ω' -ceq [char] 'Ω' # $false: codepoints differ

Note that use of prefixes i or c to explicitly specify case-matching behavior is NOT sufficient to force string comparison, even though conceptually operators such as -ceq, -ieq, -clt, -ilt, -cgt, -igt only make sense with strings.

Effectively, the i and c prefixes are simply ignored when applied to -lt and -gt while comparing [char] operands; as it turns out (unlike what I originally thought), this is a general PowerShell pitfall - see below for an explanation.

As an aside: -lt and -gt logic in string comparison is not numeric, but based on collation order (a human-centric way of ordering independent of codepoints / byte values), which in .NET terms is controlled by cultures (either by default by the one currently in effect, or by passing a culture parameter to methods).
As @PetSerAl demonstrates in a comment (and unlike what I originally claimed), PS string comparisons use the invariant culture, not the current culture, so their behavior is the same, irrespective of what culture is the current one.


Behind the scenes:

As @PetserAl explains in the comments, PowerShell's parsing doesn't distinguish between the base form of an operator its i-prefixed form; e.g., both -lt and -ilt are translated to the same value, Ilt.
Thus, Powershell cannot implement differing behavior for -lt vs. -ilt, -gt vs. igt, ..., because it treats them the same at the syntax level.

This leads to somewhat counter-intuitive behavior in that operator prefixes are effectively ignored when comparing data types where case-sensitivity has no meaning - as opposed to getting coerced to strings, as one might expect; e.g.:

"10" -cgt "2"  # $false, because "2" comes after "1" in the collation order

10 -cgt 2  # !! $true; *numeric* comparison still happens; the `c` is ignored.

In the latter case I would have expected the use of -cgt to coerce the operands to strings, given that case-sensitive comparison is only a meaningful concept in string comparison, but that is NOT how it works.

If you want to dig deeper into how PowerShell operates, see @PetSerAl's comments below.




回答2:


Not quite sure what to post here other than the comparisons are all correct when dealing with strings/characters. If you want an Ordinal comparison, do an Ordinal comparison and you get results based on that.

Best Practices for Using Strings in the .NET Framework

[string]::Compare('L','l')
returns 1

and

[string]::Compare("L","l", [stringcomparison]::Ordinal)
returns -32

Not sure what to add here to help clarify.

Also see: Upper vs Lower Case



来源:https://stackoverflow.com/questions/36096322/why-is-lt-behaving-differently-for-chars-and-strings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!