Implications of LC_ALL=C to speedup grep

前端 未结 1 1286
生来不讨喜
生来不讨喜 2020-12-31 02:00

I just discovered that if i prefix my grep commands with a LC_ALL=C it does wonders for speeding grep up.

But i am wondering about the implications.

Would a

相关标签:
1条回答
  • 2020-12-31 02:16

    You don't necessarily need UTF-8 to run into trouble here. The locale is responsible for setting the character classes, i.e. determining which character is a space, a letter or a digit. Consider these two examples:

    $ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep '[[:alnum:]]' || echo false
    ä
    $ echo -e '\xe4' | LC_ALL=C grep '[[:alnum:]]' || echo false
    false
    

    When trying to match exact binary patterns against each other, the locale doesn't make a difference, however:

    $ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep "$(echo -e '\xe4')" || echo false
    ä
    $ echo -e '\xe4' | LC_ALL=C grep "$(echo -e '\xe4')" || echo false
    ä
    

    I'm not sure about the extent of grep implementing unicode, and how well different codepoints are matched to each other, but matching any subset of ASCII and the matching of single characters without alternate binary representations should work fine regardless of locale.

    0 讨论(0)
提交回复
热议问题