grep/regex can't find accented word

后端 未结 5 2008
抹茶落季
抹茶落季 2021-01-18 20:41

I\'m trying mount a regex that get some words on a file where all letters of this word match with a word pattern.

My problem is, the regex can\'t find accented words

5条回答
  •  无人及你
    2021-01-18 21:37

    My problem is, the regex can't find accented words, but in my text file there are alot of accented words.

    My command line is:

    cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt
    cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt
    
    [...]
    

    How can I fix it?

    Grep searches these files as if they are a stream of bytes (8-bit characters). These characters must be compliant to your current locale settings also.

    It gets worse if your words.txt files are encoded in UTF-8, UTF-16, or UTF-32. Or ISO-8859-1 (latin-1).

    To handle all such encodings, use ugrep instead of grep to process files encoded in UTF and to match Unicode patterns:

    cat input/words.txt | ugrep '^[éra]\{1,4\}$' > output/words_era.txt
    cat input/words.txt | ugrep '^[carroça]\{1,7\}$' > output/words_carroca.txt
    

    This produces output encoded in UTF-8. If the input files are encoded in ISO-8859-1, then use ugrep with option -QISO-8859-1. The ugrep output is always UTF-8, however.

提交回复
热议问题