Best way to strip punctuation from a string

前端 未结 26 1820
日久生厌
日久生厌 2020-11-21 05:39

It seems like there should be a simpler way than:

import string
s = \"string. With. Punctuation?\" # Sample string 
out = s.translate(string.maketrans(\"\",\         


        
26条回答
  •  北恋
    北恋 (楼主)
    2020-11-21 05:53

    string.punctuation misses loads of punctuation marks that are commonly used in the real world. How about a solution that works for non-ASCII punctuation?

    import regex
    s = u"string. With. Some・Really Weird、Non?ASCII。 「(Punctuation)」?"
    remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
    remove.sub(u" ", s).strip()
    

    Personally, I believe this is the best way to remove punctuation from a string in Python because:

    • It removes all Unicode punctuation
    • It's easily modifiable, e.g. you can remove the \{S} if you want to remove punctuation, but keep symbols like $.
    • You can get really specific about what you want to keep and what you want to remove, for example \{Pd} will only remove dashes.
    • This regex also normalizes whitespace. It maps tabs, carriage returns, and other oddities to nice, single spaces.

    This uses Unicode character properties, which you can read more about on Wikipedia.

提交回复
热议问题