grep/regex can't find accented word

后端 未结 5 2005
抹茶落季
抹茶落季 2021-01-18 20:41

I\'m trying mount a regex that get some words on a file where all letters of this word match with a word pattern.

My problem is, the regex can\'t find accented words

相关标签:
5条回答
  • 2021-01-18 21:18

    If your file is encoded in ISO-8859-1 but your system locale is UTF-8, this will not work.

    Either convert the file to UTF-8 or change your system locale to ISO-8859-1.

    # convert from ISO-8859-1 to the environmental locale before grepping
    # output will be in the current locale
    $ iconv -f 8859_1 input/words.txt | grep ...
    
    # run grep with an ISO-8859-1 locale
    # output will be in ISO-8859-1 encoding
    $ cat input/words.txt | env LC_ALL=en_US grep ...
    
    0 讨论(0)
  • 2021-01-18 21:20

    Assuming everything is UTF-8, I’d usually just use something like

    perl -CSAD -le 'print if /^carroça{1,3}$/' filenames
    

    because then I know what it’s doing.

    0 讨论(0)
  • 2021-01-18 21:29

    Try as @dule said, but with LANG=en_US.iso88591:

    cat input/words.txt | LANG=en_US.iso88591 grep '^[éra]\{1,4\}$' > output/words_era.txt
    
    0 讨论(0)
  • 2021-01-18 21:36

    I found a related question here that seems to work.

    So if you try something like:

    cat input/words.txt | LANG=C grep '^[éra]\{1,4\}$' > output/words_era.txt
    

    Does that produce what you expect?

    0 讨论(0)
  • 2021-01-18 21:37

    My problem is, the regex can't find accented words, but in my text file there are alot of accented words.

    My command line is:

    cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt
    cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt
    
    [...]
    

    How can I fix it?

    Grep searches these files as if they are a stream of bytes (8-bit characters). These characters must be compliant to your current locale settings also.

    It gets worse if your words.txt files are encoded in UTF-8, UTF-16, or UTF-32. Or ISO-8859-1 (latin-1).

    To handle all such encodings, use ugrep instead of grep to process files encoded in UTF and to match Unicode patterns:

    cat input/words.txt | ugrep '^[éra]\{1,4\}$' > output/words_era.txt
    cat input/words.txt | ugrep '^[carroça]\{1,7\}$' > output/words_carroca.txt
    

    This produces output encoded in UTF-8. If the input files are encoded in ISO-8859-1, then use ugrep with option -QISO-8859-1. The ugrep output is always UTF-8, however.

    0 讨论(0)
提交回复
热议问题