问题
Please I need help. I've got a problem when trying to find accented words in a text (in Spanish). I have to search in a large text the first paragraph starting with the words 'Nombre vernáculo'
For example, the text is like: "Nombre vernáculo registrado en la zona de ..."
But accented words are not recoginzed by my python script.
I've tryed with:
re.compile('/(?<!\p{L})(vern[áa]culo*)(?!\p{L})/')
re.compile(r'Nombre vern[a\xc3\xa1]culo\.', re.UNICODE)
re.compile ('[A-Z][a-záéíóúñ]+')
\p{Lu}] [\p{Ll}]+ \b
I've read the following threads:
grep/regex can't find accented word
Python Regex strange behavior with accented characters
Python regex and accented Expression
Python: using regex and tokens with accented chars (negative lookbehind)
Also I found something that almost work:
In [95]: dd=re.search(r'^\w.*', 'Nombre vernáculo' )
In [96]: dd.group(0)
Out[96]: 'Nombre vern\xc3\xa1culo'
But it also returns all accented words in the text.
Any help with this will be appreciaded. Thanks.
回答1:
The simplest way to do this is the same way you'd do it in Python 3. This means you have to explicitly use unicode
instead of str
objects, include u
-prefixed string literals. And, ideally, an explicit coding declaration at the top of your file so you can write the literals in Unicode as well.
# -*- coding: utf-8 -*-
import re
pattern = re.compile(ur'Nombre vern[aá]culo'`)
text = u'Nombre vernáculo'
match = pattern.search(text)
print match
Notice that I left off the \.
on the end of the pattern. Your text doesn't end in a .
, so you shouldn't be looking for one, or it's going to fail.
Of course if you want to search text that comes from somewhere besides your source code, you'll need to decode('utf-8')
it, or io.open
or codecs.open
the file instead of just open
, etc.
If you can't use a coding declaration, or can't trust your text editor to handle UTF-8, you can still use Unicode strings, just escape the characters with their Unicode code points:
import re
pattern = re.compile(ur'Nombre vern[a\xe1]culo'`)
text = u'Nombre vern\xe1culo'
match = pattern.search(text)
print match
If you have to use str
, then you do have to manually encode to UTF-8 and escape the individual bytes, as you were trying to do. But now you're not trying to match a single character, but a multi-character sequence, \xc3\xa1
. So you can't use a character class. Instead, you have write it out explicitly as a group with alternation:
pattern = re.compile(r'Nombre vern(?:a|\xc3\xa1)culo')
text = 'Nombre vern\xc3\xa1culo'
match = pattern.search(text)
print match
回答2:
import re
r1 = re.compile(r'(Nombre vernáculo)')
x = 'Nombre vernáculo registrado en la zona de'
match = r1.search(x)
print(match.group(1))
with python 2:
/tmp> python2 test.py
File "test.py", line 5
SyntaxError: Non-ASCII character '\xc3' in file test.py on line 5, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
with python 3:
/tmp> python3 test.py
Nombre vernáculo
来源:https://stackoverflow.com/questions/50847976/python-regex-to-find-accented-words