Why does findstr not handle case properly (in some circumstances)?

后端 未结 4 1426
名媛妹妹
名媛妹妹 2020-11-29 07:08

While writing some recent scripts in cmd.exe, I had a need to use findstr with regular expressions - customer required standard cmd.exe commands (no GnuWin32 no

相关标签:
4条回答
  • 2020-11-29 07:18

    I believe this is mostly a horrible design flaw.

    We all expect the ranges to collate based on the ASCII code value. But they don't - instead the ranges are based on a collation sequence that nearly matches the default sequence used by SORT. EDIT -The exact collation sequence used by FINDSTR is now available at https://stackoverflow.com/a/20159191/1012053 under the section titled Regex character class ranges [x-y].

    I prepared a text file containing one line for each extended ASCII character from 1 - 255, excluding 10 (LF), 13 (CR), and 26 (EOF on Windows). On each line I have the character, followed by a space, followed by the decimal code for the character. I then ran the file through SORT and captured the output in a sortedChars.txt file.

    I now can easily test any regex range against this sorted file and demonstrate how the range is determined by a collation sequence that is nearly the same as SORT.

    >findstr /nrc:"^[0-9]" sortedChars.txt
    137:0 048
    138:½ 171
    139:¼ 172
    140:1 049
    141:2 050
    142:² 253
    143:3 051
    144:4 052
    145:5 053
    146:6 054
    147:7 055
    148:8 056
    149:9 057
    

    The results are not quite what we expected in that chars 171, 172 and 253 are thrown in the mix. But the results make perfect sense. The line number prefix corresponds to the SORT collation sequence, and you can see that the range exactly matches according to the SORT sequence.

    Here is another range test that exactly follows the SORT sequence:

    >findstr /nrc:"^[!-=]" sortedChars.txt
    34:! 033
    35:" 034
    36:# 035
    37:$ 036
    38:% 037
    39:& 038
    40:( 040
    41:) 041
    42:* 042
    43:, 044
    44:. 046
    45:/ 047
    46:: 058
    47:; 059
    48:? 063
    49:@ 064
    50:[ 091
    51:\ 092
    52:] 093
    53:^ 094
    54:_ 095
    55:` 096
    56:{ 123
    57:| 124
    58:} 125
    59:~ 126
    60:¡ 173
    61:¿ 168
    62:¢ 155
    63:£ 156
    64:¥ 157
    65:₧ 158
    66:+ 043
    67:∙ 249
    68:< 060
    69:= 061
    

    There is one small anomaly with alpha characters. Character "a" sorts between "A" and "Z" yet it does not match [A-Z]. "z" sorts after "Z", yet it matches [A-Z]. There is a corresponding problem with [a-z]. "A" sorts before "a", yet it matches [a-z]. "Z" sorts between "a" and "z", yet it does not match [a-z].

    Here are the [A-Z] results:

    >findstr /nrc:"^[A-Z]" sortedChars.txt
    151:A 065
    153:â 131
    154:ä 132
    155:à 133
    156:å 134
    157:Ä 142
    158:Å 143
    159:á 160
    160:ª 166
    161:æ 145
    162:Æ 146
    163:B 066
    164:b 098
    165:C 067
    166:c 099
    167:Ç 128
    168:ç 135
    169:D 068
    170:d 100
    171:E 069
    172:e 101
    173:é 130
    174:ê 136
    175:ë 137
    176:è 138
    177:É 144
    178:F 070
    179:f 102
    180:ƒ 159
    181:G 071
    182:g 103
    183:H 072
    184:h 104
    185:I 073
    186:i 105
    187:ï 139
    188:î 140
    189:ì 141
    190:í 161
    191:J 074
    192:j 106
    193:K 075
    194:k 107
    195:L 076
    196:l 108
    197:M 077
    198:m 109
    199:N 078
    200:n 110
    201:ñ 164
    202:Ñ 165
    203:ⁿ 252
    204:O 079
    205:o 111
    206:ô 147
    207:ö 148
    208:ò 149
    209:Ö 153
    210:ó 162
    211:º 167
    212:P 080
    213:p 112
    214:Q 081
    215:q 113
    216:R 082
    217:r 114
    218:S 083
    219:s 115
    220:ß 225
    221:T 084
    222:t 116
    223:U 085
    224:u 117
    225:û 150
    226:ù 151
    227:ú 163
    228:ü 129
    229:Ü 154
    230:V 086
    231:v 118
    232:W 087
    233:w 119
    234:X 088
    235:x 120
    236:Y 089
    237:y 121
    238:ÿ 152
    239:Z 090
    240:z 122
    

    And the [a-z] results

    >findstr /nrc:"^[a-z]" sortedChars.txt
    151:A 065
    152:a 097
    153:â 131
    154:ä 132
    155:à 133
    156:å 134
    157:Ä 142
    158:Å 143
    159:á 160
    160:ª 166
    161:æ 145
    162:Æ 146
    163:B 066
    164:b 098
    165:C 067
    166:c 099
    167:Ç 128
    168:ç 135
    169:D 068
    170:d 100
    171:E 069
    172:e 101
    173:é 130
    174:ê 136
    175:ë 137
    176:è 138
    177:É 144
    178:F 070
    179:f 102
    180:ƒ 159
    181:G 071
    182:g 103
    183:H 072
    184:h 104
    185:I 073
    186:i 105
    187:ï 139
    188:î 140
    189:ì 141
    190:í 161
    191:J 074
    192:j 106
    193:K 075
    194:k 107
    195:L 076
    196:l 108
    197:M 077
    198:m 109
    199:N 078
    200:n 110
    201:ñ 164
    202:Ñ 165
    203:ⁿ 252
    204:O 079
    205:o 111
    206:ô 147
    207:ö 148
    208:ò 149
    209:Ö 153
    210:ó 162
    211:º 167
    212:P 080
    213:p 112
    214:Q 081
    215:q 113
    216:R 082
    217:r 114
    218:S 083
    219:s 115
    220:ß 225
    221:T 084
    222:t 116
    223:U 085
    224:u 117
    225:û 150
    226:ù 151
    227:ú 163
    228:ü 129
    229:Ü 154
    230:V 086
    231:v 118
    232:W 087
    233:w 119
    234:X 088
    235:x 120
    236:Y 089
    237:y 121
    238:ÿ 152
    240:z 122
    

    Sort sorts upper case before lower case. (EDIT - I just read the help for SORT and learned that it does not differentiate between upper and lower case. The fact that my SORT output consistently put upper before lower is probably a result of the order of the input.) But regex apparently sorts lower case before upper case. All of the following ranges fail to match any characters.

    >findstr /nrc:"^[A-a]" sortedChars.txt
    
    >findstr /nrc:"^[B-b]" sortedChars.txt
    
    >findstr /nrc:"^[C-c]" sortedChars.txt
    
    >findstr /nrc:"^[D-d]" sortedChars.txt
    

    Reversing the order finds the characters.

    >findstr /nrc:"^[a-A]" sortedChars.txt
    151:A 065
    152:a 097
    
    >findstr /nrc:"^[b-B]" sortedChars.txt
    163:B 066
    164:b 098
    
    >findstr /nrc:"^[c-C]" sortedChars.txt
    165:C 067
    166:c 099
    
    >findstr /nrc:"^[d-D]" sortedChars.txt
    169:D 068
    170:d 100
    

    There are additional characters that regex sorts differently than SORT, but I haven't got a precise list.

    0 讨论(0)
  • 2020-11-29 07:18

    Everyone above is wrong. The alpha chars order is the follwoing: aAbBcCdDeE..zZ so echo a | findstr /r "[A-Z]" returns nothing, since a is outside of that range.

    echo abc|findstr /r "[A-Z][A-Z][A-Z]" also returns nothing, since first range group matches b, second one matches c and the third one matches nothing and thus the whole regex pattern finds nothing.

    If you like to match any character of latin alphabet - use [a-Z].

    0 讨论(0)
  • 2020-11-29 07:23

    So if you want

    • only numbers : FindStr /R "^[0123-9]*$"

    • octal : FindStr /R "^[0123-7]*$"

    • hexadécimal : FindStr /R "^[0123-9aAb-Cd-EfF]*$"

    • alpha with no accent : FindStr /R "^[aAb-Cd-EfFg-Ij-NoOp-St-Uv-YzZ]*$"

    • alphanumeric : FindStr /R "^[0123-9aAb-Cd-EfFg-Ij-NoOp-St-Uv-YzZ]*$"

    0 讨论(0)
  • 2020-11-29 07:34

    This appears to be caused by the use of ranges within regular expression searches.

    It doesn't occur for the first character in the range. It doesn't occur at all for non-ranges.

    > echo a | findstr /r "[A-C]"
    > echo b | findstr /r "[A-C]"
        b
    > echo c | findstr /r "[A-C]"
        c
    > echo d | findstr /r "[A-C]"
    > echo b | findstr /r "[B-C]"
    > echo c | findstr /r "[B-C]"
        c
    
    > echo a | findstr /r "[ABC]"
    > echo b | findstr /r "[ABC]"
    > echo c | findstr /r "[ABC]"
    > echo d | findstr /r "[ABC]"
    > echo b | findstr /r "[BC]"
    > echo c | findstr /r "[BC]"
    
    > echo A | findstr /r "[A-C]"
        A
    > echo B | findstr /r "[A-C]"
        B
    > echo C | findstr /r "[A-C]"
        C
    > echo D | findstr /r "[A-C]"
    

    According to the SS64 CMD FINDSTR page (which, in a stunning display of circularity, references this question), the range [A-Z]:

    ... includes the complete English alphabet, both upper and lower case (except for "a"), as well as non-English alpha characters with diacriticals.

    To get around the problem in my environment, I simply used specific regular expressions (such as [ABCD] rather than [A-D]). A more sensible approach for those that are allowed would be to download CygWin or GnuWin32 and use grep from one of those packages.

    0 讨论(0)
提交回复
热议问题