How to remove special characters from txt files using Python

后端 未结 3 1776
南笙
南笙 2021-01-06 04:32
from glob import glob
pattern = \"D:\\\\report\\\\shakeall\\\\*.txt\"
filelist = glob(pattern)
def countwords(fp):
    with open(fp) as fh:
        return len(fh.rea         


        
相关标签:
3条回答
  • 2021-01-06 05:02

    I'm pretty new and I doubt this is very elegant at all, but one option would be to take your string(s) after reading them in and running them through string.translate() to strip out the punctuation. Here is the Python documentation for it for version 2.7 (which i think you're using).

    As far as the actual code, it might be something like this (but maybe someone better than me can confirm/improve on it):

    fileString.translate(None, string.punctuation)
    

    where "fileString" is the string that your open(fp) read in. "None" is provided in place of a translation table (which would normally be used to actually change some characters into others), and the second parameter, string.punctuation (a Python string constant containing all the punctuation symbols) is a set of characters that will be deleted from your string.

    In the event that the above doesn't work, you could modify it as follows:

    inChars = string.punctuation
    outChars = ['']*32
    tranlateTable = maketrans(inChars, outChars)
    fileString.translate(tranlateTable)
    

    There are a couple of other answers to similar questions i found via a quick search. I'll link them here, too, in case you can get more from them.

    Removing Punctuation From Python List Items

    Remove all special characters, punctuation and spaces from string

    Strip Specific Punctuation in Python 2.x


    Finally, if what I've said is completely wrong please comment and i'll remove it so that others don't try what I've said and become frustrated.

    0 讨论(0)
  • 2021-01-06 05:06
    import re
    string = open('a.txt').read()
    new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
    open('b.txt', 'w').write(new_str)
    

    It will change every non alphanumeric char to white space.

    0 讨论(0)
  • 2021-01-06 05:12
    import re
    

    Then replace

    [uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
    

    By

    [uniquewords.add(re.sub('[^a-zA-Z0-9]*$', '', x) for x in open(os.path.join(root,name)).read().split()]
    

    This will strip all trailing non-alphanumeric characters from each word before adding it to the set.

    0 讨论(0)
提交回复
热议问题