How to implement Unicode string matching by folding in python

后端未结

关注

 5  868

I have an application implementing incremental search. I have a catalog of unicode strings to be matched and match them to a given \"key\" string; a catalog string is a \"hi

相关标签:

5条回答

时光说笑

2020-12-13 11:48

Have a look at this: ftp://alan.smcvt.edu/hefferon/unicode2ascii.py

Probably not complete, but might get you started.

0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2020-12-13 11:59

For my application, I already addressed this in a different comment: I want to have a unicode result and leave unhandled characters untouched.

In that case, the correct way to do this is to create a UCA collator object with its strength set to compare at primary strength only, which thereby completely disregards diacritics.

I show how to do this using Perl in this answer. The first collator object is at the strength you need, while the second one considers accents for tie-breaking.

You will note that no strings have been harmed in the making of these comparisons: the original data is untouched.

0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2020-12-13 12:04
A general purpose solution (especially for search normalization and generating slugs) is the unidecode module:

http://pypi.python.org/pypi/Unidecode

It's a port of the Text::Unidecode module for Perl. It's not complete, but it translates all Latin-derived characters I could find, transliterates Cyrillic, Chinese, etc to Latin and even handles full-width characters correctly.

It's probably a good idea to simply strip all characters you don't want to have in the final output or replace them with a filler (e.g. "äßœ$" will be unidecoded to "assoe$", so you might want to strip the non-alphanumerics). For characters it will transliterate but shouldn't (say, §=>SS and €=>EU) you need to clean up the input:
```
input_str = u'äßœ$'
input_str = u''.join([ch if ch.isalnum() else u'-' for ch in input_str])
input_str = str(unidecode(input_str)).lower()
```
This would replace all non-alphanumeric characters with a dummy replacement and then transliterate the string and turn it into lowercase.
0 讨论(0)
发布评论:

提交评论
- 加载中...

情话喂你

2020-12-13 12:12

You can use this strip_accents function to remove the accents:

def strip_accents(s):
   return ''.join((c for c in unicodedata.normalize('NFD', unicode(s)) if unicodedata.category(c) != 'Mn'))

>>> strip_accents(u'Östblocket')
'Ostblocket'

0 讨论(0)

清歌不尽

2020-12-13 12:13
What about this one:
```
normalize('NFKD', unicode_string).encode('ASCII', 'ignore').lower()
```
Taken from here (Spanish) http://python.org.ar/pyar/Recetario/NormalizarCaracteresUnicode
0 讨论(0)
发布评论:

提交评论
- 加载中...