I\'m trying mount a regex that get some words on a file where all letters of this word match with a word pattern.
My problem is, the regex can\'t find accented words
My problem is, the regex can't find accented words, but in my text file there are alot of accented words.
My command line is:
cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt [...]
How can I fix it?
Grep searches these files as if they are a stream of bytes (8-bit characters). These characters must be compliant to your current locale settings also.
It gets worse if your words.txt
files are encoded in UTF-8, UTF-16, or UTF-32. Or ISO-8859-1 (latin-1).
To handle all such encodings, use ugrep instead of grep to process files encoded in UTF and to match Unicode patterns:
cat input/words.txt | ugrep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | ugrep '^[carroça]\{1,7\}$' > output/words_carroca.txt
This produces output encoded in UTF-8. If the input files are encoded in ISO-8859-1, then use ugrep with option -QISO-8859-1
. The ugrep output is always UTF-8, however.