发表新帖

发表新帖

Normalizing unicode text to filenames, etc. in Python

前端未结

关注

 5  1031

情书的邮戳 2021-02-01 05:42

Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python?

E.g. turn My International Text: åäö

5条回答

孤街浪徒 (楼主)

2021-02-01 06:14
The way to solve this problem is to make a decision on which characters are allowed (different systems have different rules for valid identifiers.

Once you decide on which characters are allowed, write an allowed() predicate and a dict subclass for use with str.translate:
```
def makesafe(text, allowed, substitute=None):
    ''' Remove unallowed characters from text.
        If *substitute* is defined, then replace
        the character with the given substitute.
    '''
    class D(dict):
        def __getitem__(self, key):
            return key if allowed(chr(key)) else substitute
    return text.translate(D())
```
This function is very flexible. It let's you easily specify rules for deciding which text is kept and which text is either replaced or removed.

Here's a simple example using the rule, "only allow characters that are in the unicode category L":
```
import unicodedata

def allowed(character):
    return unicodedata.category(character).startswith('L')

print(makesafe('the*ides&of*march', allowed, '_'))
print(makesafe('the*ides&of*march', allowed))
```
That code produces safe output as follows:
```
the_ides_of_march
theidesofmarch
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题