I\'m trying to remove the punctuation from a unicode string, which may contain non-ascii letters. I tried using the regex
module:
import regex
text
\p{P}
matches punctuation characters.
Those punctuation characters are
! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~
<
and >
are not punctuation characters. So they won't be removed.
Try this instead
re.sub('[\p{L}<>]+',"",text)
<
and >
are classified as Math Symbols (Sm), not Punctuation (P). You can match either:
regex.sub('[\p{P}\p{Sm}]+', '', text)
The unicode.translate()
method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None
; None
removes that codepoint. Map string.punctuation
to codepoints with ord()
:
text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
That only removes only the limited number of ASCII punctuation characters.
Demo:
>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik
If string.punctuation
is not enough, then you can generate a complete str.translate()
mapping for all P
and Sm
codepoints by iterating from 0 to sys.maxunicode
, then test those values against unicodedata.category():
>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik
(For Python 3, replace unicode
with str
, and print ...
with print(...))
.
Try string
module
import string,re
text = u"<Üäik>"
out = re.sub('[%s]' % re.escape(string.punctuation), '', text)
print out
print type(out)
Prints-
Üäik
<type 'unicode'>