What are the undocumented features and limitations of the Windows FINDSTR command?

后端 未结 8 1736
耶瑟儿~
耶瑟儿~ 2020-11-21 04:23

The Windows FINDSTR command is horribly documented. There is very basic command line help available through FINDSTR /?, or HELP FINDSTR, but it is

8条回答
  •  情歌与酒
    2020-11-21 05:24

    Answer continued from part 1 above - I've run into the 30,000 character answer limit :-(

    Limited Regular Expressions (regex) Support
    FINDSTR support for regular expressions is extremely limited. If it is not in the HELP documentation, it is not supported.

    Beyond that, the regex expressions that are supported are implemented in a completely non-standard manner, such that results can be different then would be expected coming from something like grep or perl.

    Regex Line Position anchors ^ and $
    ^ matches beginning of input stream as well as any position immediately following a . Since FINDSTR also breaks lines after , a simple regex of "^" will always match all lines within a file, even a binary file.

    $ matches any position immediately preceding a <CR>. This means that a regex search string containing $ will never match any lines within a Unix style text file, nor will it match the last line of a Windows text file if it is missing the EOL marker of .

    Note - As previously discussed, piped and redirected input to FINDSTR may have appended that is not in the source. Obviously this can impact a regex search that uses $.

    Any search string with characters before ^ or after $ will always fail to find a match.

    Positional Options /B /E /X
    The positional options work the same as ^ and $, except they also work for literal search strings.

    /B functions the same as ^ at the start of a regex search string.

    /E functions the same as $ at the end of a regex search string.

    /X functions the same as having both ^ at the beginning and $ at the end of a regex search string.

    Regex word boundary
    \< must be the very first term in the regex. The regex will not match anything if any other characters precede it. \< corresponds to either the very beginning of the input, the beginning of a line (the position immediately following a ), or the position immediately following any "non-word" character. The next character need not be a "word" character.

    \> must be the very last term in the regex. The regex will not match anything if any other characters follow it. \> corresponds to either the end of input, the position immediately prior to a , or the position immediately preceding any "non-word" character. The preceding character need not be a "word" character.

    Here is a complete list of "non-word" characters, represented as the decimal byte code. Note - this list was compiled on a U.S machine. I do not know what impact other languages may have on this list.

    001   028   063   179   204   230
    002   029   064   180   205   231
    003   030   091   181   206   232
    004   031   092   182   207   233
    005   032   093   183   208   234
    006   033   094   184   209   235
    007   034   096   185   210   236
    008   035   123   186   211   237
    009   036   124   187   212   238
    011   037   125   188   213   239
    012   038   126   189   214   240
    014   039   127   190   215   241
    015   040   155   191   216   242
    016   041   156   192   217   243
    017   042   157   193   218   244
    018   043   158   194   219   245
    019   044   168   195   220   246
    020   045   169   196   221   247
    021   046   170   197   222   248
    022   047   173   198   223   249
    023   058   174   199   224   250
    024   059   175   200   226   251
    025   060   176   201   227   254
    026   061   177   202   228   255
    027   062   178   203   229
    

    Regex character class ranges [x-y]
    Character class ranges do not work as expected. See this question: Why does findstr not handle case properly (in some circumstances)?, along with this answer: https://stackoverflow.com/a/8767815/1012053.

    The problem is FINDSTR does not collate the characters by their byte code value (commonly thought of as the ASCII code, but ASCII is only defined from 0x00 - 0x7F). Most regex implementations would treat [A-Z] as all upper case English capital letters. But FINDSTR uses a collation sequence that roughly corresponds to how SORT works. So [A-Z] includes the complete English alphabet, both upper and lower case (except for "a"), as well as non-English alpha characters with diacriticals.

    Below is a complete list of all characters supported by FINDSTR, sorted in the collation sequence used by FINDSTR to establish regex character class ranges. The characters are represented as their decimal byte code value. I believe the collation sequence makes the most sense if the characters are viewed using code page 437. Note - this list was compiled on a U.S machine. I do not know what impact other languages may have on this list.

    001
    002
    003
    004
    005
    006
    007
    008
    014
    015
    016
    017
    018           
    019
    020
    021
    022
    023
    024
    025
    026
    027
    028
    029
    030
    031
    127
    039
    045
    032
    255
    009
    010
    011
    012
    013
    033
    034
    035
    036
    037
    038
    040
    041
    042
    044
    046
    047
    058
    059
    063
    064
    091
    092
    093
    094
    095
    096
    123
    124
    125
    126
    173
    168
    155
    156
    157
    158
    043
    249
    060
    061
    062
    241
    174
    175
    246
    251
    239
    247
    240
    243
    242
    169
    244
    245
    254
    196
    205
    179
    186
    218
    213
    214
    201
    191
    184
    183
    187
    192
    212
    211
    200
    217
    190
    189
    188
    195
    198
    199
    204
    180
    181
    182
    185
    194
    209
    210
    203
    193
    207
    208
    202
    197
    216
    215
    206
    223
    220
    221
    222
    219
    176
    177
    178
    170
    248
    230
    250
    048
    172
    171
    049
    050
    253
    051
    052
    053
    054
    055
    056
    057
    236
    097
    065
    166
    160
    133
    131
    132
    142
    134
    143
    145
    146
    098
    066
    099
    067
    135
    128
    100
    068
    101
    069
    130
    144
    138
    136
    137
    102
    070
    159
    103
    071
    104
    072
    105
    073
    161
    141
    140
    139
    106
    074
    107
    075
    108
    076
    109
    077
    110
    252
    078
    164
    165
    111
    079
    167
    162
    149
    147
    148
    153
    112
    080
    113
    081
    114
    082
    115
    083
    225
    116
    084
    117
    085
    163
    151
    150
    129
    154
    118
    086
    119
    087
    120
    088
    121
    089
    152
    122
    090
    224
    226
    235
    238
    233
    227
    229
    228
    231
    237
    232
    234
    

    Regex character class term limit and BUG
    Not only is FINDSTR limited to a maximum of 15 character class terms within a regex, it fails to properly handle an attempt to exceed the limit. Using 16 or more character class terms results in an interactive Windows pop up stating "Find String (QGREP) Utility has encountered a problem and needs to close. We are sorry for the inconvenience." The message text varies slightly depending on the Windows version. Here is one example of a FINDSTR that will fail:

    echo 01234567890123456|findstr [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]
    

    This bug was reported by DosTips user Judago here. It has been confirmed on XP, Vista, and Windows 7.

    Regex searches fail (and may hang indefinitely) if they include byte code 0xFF (decimal 255)
    Any regex search that includes byte code 0xFF (decimal 255) will fail. It fails if byte code 0xFF is included directly, or if it is implicitly included within a character class range. Remember that FINDSTR character class ranges do not collate characters based on the byte code value. Character <0xFF> appears relatively early in the collation sequence between the and characters. So any character class range that includes both and will fail.

    The exact behavior changes slightly depending on the Windows version. Windows 7 hangs indefinitely if 0xFF is included. XP doesn't hang, but it always fails to find a match, and occasionally prints the following error message - "The process tried to write to a nonexistent pipe."

    I no longer have access to a Vista machine, so I haven't been able to test on Vista.

    Regex bug: . and [^anySet] can match End-Of-File
    The regex . meta-character should only match any character other than or . There is a bug that allows it to match the End-Of-File if the last line in the file is not terminated by or . However, the . will not match an empty file.

    For example, a file named "test.txt" containing a single line of x, without terminating or , will match the following:

    findstr /r x......... test.txt
    

    This bug has been confirmed on XP and Win7.

    The same seems to be true for negative character sets. Something like [^abc] will match End-Of-File. Positive character sets like [abc] seem to work fine. I have only tested this on Win7.

自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题