Python - pyparsing unicode characters

后端未结

关注

 3  1769

:) I tried using w = Word(printables), but it isn\'t working. How should I give the spec for this. \'w\' is meant to process Hindi characters (UTF-8)

The code specif

相关标签:

3条回答

迷失自我

2020-12-16 17:11

As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.

If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).

0 讨论(0)
发布评论:

提交评论
- 加载中...

失恋的感觉

2020-12-16 17:19

I Was searching about french unicode chars and fall on this question. If you search french or other latin accents, with pyparsing 2.3.0 you can use:

>>> pp.pyparsing_unicode.Latin1.alphas
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'

0 讨论(0)

无人共我

2020-12-16 17:34
Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:
```
unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
                                        if not unichr(c).isspace())
```
Now you can define trans using this more complete set of non-space characters:
```
trans = Word(unicodePrintables)
```
I was unable to test against your Hindi test string, but I think this will do the trick.

(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:
```
unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
                                        if not chr(c).isspace())
```
EDIT:

With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.
```
import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...