How to remove special characters except space from a file in python?

后端 未结 5 1521
误落风尘
误落风尘 2021-02-19 03:49

I have a huge corpus of text (line by line) and I want to remove special characters but sustain the space and structure of the string.

hello? there A-Z-R_T(,**)         


        
相关标签:
5条回答
  • 2021-02-19 04:12

    I think nfn neil answer is great...but i would just add a simple regex to remove all no words character,however it will consider underscore as part of the word

    print  re.sub(r'\W+', ' ', string)
    >>> hello there A Z R_T world welcome to python
    
    0 讨论(0)
  • 2021-02-19 04:13

    Create a dictionary mapping special characters to None

    d = {c:None for c in special_characters}
    

    Make a translation table using the dictionary. Read the entire text into a variable and use str.translate on the entire text.

    0 讨论(0)
  • 2021-02-19 04:21

    you can try this

    import re
    sentance = '''hello? there A-Z-R_T(,**), world, welcome to python. this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
    res = re.sub('[!,*)@#%(&$_?.^]', '', sentance)
    print(res)
    

    re.sub('["]') -> here you can add which symbol you want to remove

    0 讨论(0)
  • 2021-02-19 04:25

    You can use this pattern, too, with regex:

    import re
    a = '''hello? there A-Z-R_T(,**), world, welcome to python.
    this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
    
    for k in a.split("\n"):
        print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
        # Or:
        # final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
        # print(final)
    

    Output:

    hello there A Z R T world welcome to python 
    this should the next line followed by an other million like this 
    

    Edit:

    Otherwise, you can store the final lines into a list:

    final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
    print(final)
    

    Output:

    ['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']
    
    0 讨论(0)
  • 2021-02-19 04:33

    A more elegant solution would be

    print(re.sub(r"\W+|_", " ", string))

    >>> hello there A Z R T world welcome to python this should the next line followed by another million like this

    Here, re is regex module in python

    re.sub will substitute pattern with space i.e., " "

    r'' will treat input string as raw (with \n)

    \W for all non-words i.e. all special characters *&^%$ etc excluding underscore _

    + will match zero to unlimited matches, similar to * (one to more)

    | is logical OR

    _ stands for underscore

    0 讨论(0)
提交回复
热议问题