Strip special characters and punctuation from a unicode string

后端 未结 3 682
滥情空心
滥情空心 2021-01-27 13:26

I\'m trying to remove the punctuation from a unicode string, which may contain non-ascii letters. I tried using the regex module:

import regex
text          


        
相关标签:
3条回答
  • 2021-01-27 13:55

    \p{P} matches punctuation characters.

    Those punctuation characters are

    ! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~
    

    < and > are not punctuation characters. So they won't be removed.

    Try this instead

    re.sub('[\p{L}<>]+',"",text)
    
    0 讨论(0)
  • 2021-01-27 14:03

    < and > are classified as Math Symbols (Sm), not Punctuation (P). You can match either:

    regex.sub('[\p{P}\p{Sm}]+', '', text)
    

    The unicode.translate() method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None; None removes that codepoint. Map string.punctuation to codepoints with ord():

    text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
    

    That only removes only the limited number of ASCII punctuation characters.

    Demo:

    >>> import regex
    >>> text = u"<Üäik>"
    >>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
    Üäik
    >>> import string
    >>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
    Üäik
    

    If string.punctuation is not enough, then you can generate a complete str.translate() mapping for all P and Sm codepoints by iterating from 0 to sys.maxunicode, then test those values against unicodedata.category():

    >>> import sys, unicodedata
    >>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
    >>> print text.translate(toremove)
    Üäik
    

    (For Python 3, replace unicode with str, and print ... with print(...)).

    0 讨论(0)
  • 2021-01-27 14:08

    Try string module

    import string,re
    text = u"<Üäik>"
    out = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    print out
    print type(out)
    

    Prints-

    Üäik
    <type 'unicode'>
    
    0 讨论(0)
提交回复
热议问题