Python - pyparsing unicode characters

后端 未结 3 1769
伪装坚强ぢ
伪装坚强ぢ 2020-12-16 16:54

:) I tried using w = Word(printables), but it isn\'t working. How should I give the spec for this. \'w\' is meant to process Hindi characters (UTF-8)

The code specif

相关标签:
3条回答
  • 2020-12-16 17:11

    As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.

    If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).

    0 讨论(0)
  • 2020-12-16 17:19

    I Was searching about french unicode chars and fall on this question. If you search french or other latin accents, with pyparsing 2.3.0 you can use:

    >>> pp.pyparsing_unicode.Latin1.alphas
    'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
    
    0 讨论(0)
  • 2020-12-16 17:34

    Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:

    unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
                                            if not unichr(c).isspace())
    

    Now you can define trans using this more complete set of non-space characters:

    trans = Word(unicodePrintables)
    

    I was unable to test against your Hindi test string, but I think this will do the trick.

    (If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:

    unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
                                            if not chr(c).isspace())
    

    EDIT:

    With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.

    import pyparsing as pp
    pp.Word(pp.pyparsing_unicode.printables)
    pp.Word(pp.pyparsing_unicode.Devanagari.printables)
    pp.Word(pp.pyparsing_unicode.देवनागरी.printables)
    
    0 讨论(0)
提交回复
热议问题