Best way to strip punctuation from a string

前端 未结 26 1868
日久生厌
日久生厌 2020-11-21 05:39

It seems like there should be a simpler way than:

import string
s = \"string. With. Punctuation?\" # Sample string 
out = s.translate(string.maketrans(\"\",\         


        
相关标签:
26条回答
  • 2020-11-21 05:50
    import re
    s = "string. With. Punctuation?" # Sample string 
    out = re.sub(r'[^a-zA-Z0-9\s]', '', s)
    
    0 讨论(0)
  • 2020-11-21 05:50

    Just as an update, I rewrote the @Brian example in Python 3 and made changes to it to move regex compile step inside of the function. My thought here was to time every single step needed to make the function work. Perhaps you are using distributed computing and can't have regex object shared between your workers and need to have re.compile step at each worker. Also, I was curious to time two different implementations of maketrans for Python 3

    table = str.maketrans({key: None for key in string.punctuation})
    

    vs

    table = str.maketrans('', '', string.punctuation)
    

    Plus I added another method to use set, where I take advantage of intersection function to reduce number of iterations.

    This is the complete code:

    import re, string, timeit
    
    s = "string. With. Punctuation"
    
    
    def test_set(s):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in s if ch not in exclude)
    
    
    def test_set2(s):
        _punctuation = set(string.punctuation)
        for punct in set(s).intersection(_punctuation):
            s = s.replace(punct, ' ')
        return ' '.join(s.split())
    
    
    def test_re(s):  # From Vinko's solution, with fix.
        regex = re.compile('[%s]' % re.escape(string.punctuation))
        return regex.sub('', s)
    
    
    def test_trans(s):
        table = str.maketrans({key: None for key in string.punctuation})
        return s.translate(table)
    
    
    def test_trans2(s):
        table = str.maketrans('', '', string.punctuation)
        return(s.translate(table))
    
    
    def test_repl(s):  # From S.Lott's solution
        for c in string.punctuation:
            s=s.replace(c,"")
        return s
    
    
    print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
    print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
    print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
    print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
    print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
    print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))
    

    This is my results:

    sets      : 3.1830138750374317
    sets2      : 2.189873124472797
    regex     : 7.142953420989215
    translate : 4.243278483860195
    translate2 : 2.427158243022859
    replace   : 4.579746678471565
    
    0 讨论(0)
  • 2020-11-21 05:50
    with open('one.txt','r')as myFile:
    
        str1=myFile.read()
    
        print(str1)
    
    
        punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"] 
    
    for i in punctuation:
    
            str1 = str1.replace(i," ") 
            myList=[]
            myList.extend(str1.split(" "))
    print (str1) 
    for i in myList:
    
        print(i,end='\n')
        print ("____________")
    
    0 讨论(0)
  • 2020-11-21 05:51

    I usually use something like this:

    >>> s = "string. With. Punctuation?" # Sample string
    >>> import string
    >>> for c in string.punctuation:
    ...     s= s.replace(c,"")
    ...
    >>> s
    'string With Punctuation'
    
    0 讨论(0)
  • 2020-11-21 05:53

    string.punctuation misses loads of punctuation marks that are commonly used in the real world. How about a solution that works for non-ASCII punctuation?

    import regex
    s = u"string. With. Some・Really Weird、Non?ASCII。 「(Punctuation)」?"
    remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
    remove.sub(u" ", s).strip()
    

    Personally, I believe this is the best way to remove punctuation from a string in Python because:

    • It removes all Unicode punctuation
    • It's easily modifiable, e.g. you can remove the \{S} if you want to remove punctuation, but keep symbols like $.
    • You can get really specific about what you want to keep and what you want to remove, for example \{Pd} will only remove dashes.
    • This regex also normalizes whitespace. It maps tabs, carriage returns, and other oddities to nice, single spaces.

    This uses Unicode character properties, which you can read more about on Wikipedia.

    0 讨论(0)
  • 2020-11-21 05:53

    Here is a function I wrote. It's not very efficient, but it is simple and you can add or remove any punctuation that you desire:

    def stripPunc(wordList):
        """Strips punctuation from list of words"""
        puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
        for punc in puncList:
            for word in wordList:
                wordList=[word.replace(punc,'') for word in wordList]
        return wordList
    
    0 讨论(0)
提交回复
热议问题