python regex to find accented words

不羁岁月 提交于 2021-02-20 04:08:23

问题


Please I need help. I've got a problem when trying to find accented words in a text (in Spanish). I have to search in a large text the first paragraph starting with the words 'Nombre vernáculo'
For example, the text is like: "Nombre vernáculo registrado en la zona de ..."
But accented words are not recoginzed by my python script.

I've tryed with:

re.compile('/(?<!\p{L})(vern[áa]culo*)(?!\p{L})/')
re.compile(r'Nombre vern[a\xc3\xa1]culo\.', re.UNICODE)
re.compile ('[A-Z][a-záéíóúñ]+')
\p{Lu}] [\p{Ll}]+ \b

I've read the following threads:

grep/regex can't find accented word
Python Regex strange behavior with accented characters
Python regex and accented Expression
Python: using regex and tokens with accented chars (negative lookbehind)

Also I found something that almost work:

In [95]: dd=re.search(r'^\w.*', 'Nombre vernáculo' )
In [96]: dd.group(0)
Out[96]: 'Nombre vern\xc3\xa1culo'

But it also returns all accented words in the text.

Any help with this will be appreciaded. Thanks.


回答1:


The simplest way to do this is the same way you'd do it in Python 3. This means you have to explicitly use unicode instead of str objects, include u-prefixed string literals. And, ideally, an explicit coding declaration at the top of your file so you can write the literals in Unicode as well.

# -*- coding: utf-8 -*-

import re

pattern = re.compile(ur'Nombre vern[aá]culo'`)
text = u'Nombre vernáculo'
match = pattern.search(text)
print match

Notice that I left off the \. on the end of the pattern. Your text doesn't end in a ., so you shouldn't be looking for one, or it's going to fail.

Of course if you want to search text that comes from somewhere besides your source code, you'll need to decode('utf-8') it, or io.open or codecs.open the file instead of just open, etc.


If you can't use a coding declaration, or can't trust your text editor to handle UTF-8, you can still use Unicode strings, just escape the characters with their Unicode code points:

import re

pattern = re.compile(ur'Nombre vern[a\xe1]culo'`)
text = u'Nombre vern\xe1culo'
match = pattern.search(text)
print match

If you have to use str, then you do have to manually encode to UTF-8 and escape the individual bytes, as you were trying to do. But now you're not trying to match a single character, but a multi-character sequence, \xc3\xa1. So you can't use a character class. Instead, you have write it out explicitly as a group with alternation:

pattern = re.compile(r'Nombre vern(?:a|\xc3\xa1)culo')
text = 'Nombre vern\xc3\xa1culo'
match = pattern.search(text)
print match



回答2:


import re
r1 = re.compile(r'(Nombre vernáculo)')
x = 'Nombre vernáculo registrado en la zona de'
match = r1.search(x)
print(match.group(1))

with python 2:

/tmp> python2 test.py
  File "test.py", line 5
SyntaxError: Non-ASCII character '\xc3' in file test.py on line 5, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

with python 3:

/tmp> python3 test.py 
Nombre vernáculo


来源:https://stackoverflow.com/questions/50847976/python-regex-to-find-accented-words

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!