I have a huge corpus of text (line by line) and I want to remove special characters but sustain the space and structure of the string.
hello? there A-Z-R_T(,**)
A more elegant solution would be
print(re.sub(r"\W+|_", " ", string))
>>> hello there A Z R T world welcome to python this should the next line followed by another million like this
Here,
re
is regex
module in python
re.sub
will substitute pattern with space i.e., " "
r''
will treat input string as raw (with \n)
\W
for all non-words i.e. all special characters *&^%$ etc excluding underscore _
+
will match zero to unlimited matches, similar to * (one to more)
|
is logical OR
_
stands for underscore