I\'m trying mount a regex that get some words on a file where all letters of this word match with a word pattern.
My problem is, the regex can\'t find accented words
If your file is encoded in ISO-8859-1 but your system locale is UTF-8, this will not work.
Either convert the file to UTF-8 or change your system locale to ISO-8859-1.
# convert from ISO-8859-1 to the environmental locale before grepping # output will be in the current locale $ iconv -f 8859_1 input/words.txt | grep ... # run grep with an ISO-8859-1 locale # output will be in ISO-8859-1 encoding $ cat input/words.txt | env LC_ALL=en_US grep ...
Assuming everything is UTF-8, I’d usually just use something like
perl -CSAD -le 'print if /^carroça{1,3}$/' filenames
because then I know what it’s doing.
Try as @dule said, but with LANG=en_US.iso88591
:
cat input/words.txt | LANG=en_US.iso88591 grep '^[éra]\{1,4\}$' > output/words_era.txt
I found a related question here that seems to work.
So if you try something like:
cat input/words.txt | LANG=C grep '^[éra]\{1,4\}$' > output/words_era.txt
Does that produce what you expect?
My problem is, the regex can't find accented words, but in my text file there are alot of accented words.
My command line is:
cat input/words.txt | grep '^[éra]\{1,4\}$' > output/words_era.txt cat input/words.txt | grep '^[carroça]\{1,7\}$' > output/words_carroca.txt [...]
How can I fix it?
Grep searches these files as if they are a stream of bytes (8-bit characters). These characters must be compliant to your current locale settings also.
It gets worse if your words.txt
files are encoded in UTF-8, UTF-16, or UTF-32. Or ISO-8859-1 (latin-1).
To handle all such encodings, use ugrep instead of grep to process files encoded in UTF and to match Unicode patterns:
cat input/words.txt | ugrep '^[éra]\{1,4\}$' > output/words_era.txt
cat input/words.txt | ugrep '^[carroça]\{1,7\}$' > output/words_carroca.txt
This produces output encoded in UTF-8. If the input files are encoded in ISO-8859-1, then use ugrep with option -QISO-8859-1
. The ugrep output is always UTF-8, however.