Python remove punctuation from a text file

拥有回忆 提交于 2019-12-12 18:27:54

问题


I'm trying to remove a list of punctuation from my text file but I have only one problem with words separated from hyphen. For example, if I have the word "post-trauma" I get "posttrama" conversely I want to get "post" "trauma".

My code is:

 punct=['!', '#', '"', '%', '$', '&', ')', '(', '+', '*', '-'] 

 with open(myFile, "r") as f:
      text= f.read()
      remove = '|'.join(REMOVE_LIST) #list of word to remove
      regex = re.compile(r'('+remove+r')', flags=re.IGNORECASE) 
      out = regex.sub("", text)

      delta= " ".join(out.split())
      txt = "".join(c for c in delta if c not in punct )

Is there a way to solve it?


回答1:


I believe you can just call the built-in replace function on delta, so your last line would become the following:

txt = "".join(c for c in delta.replace("-", " ") if c not in punct )

This means all the hyphens in your text will become spaces, so the words will be treated as if they were separate.




回答2:


The above method might not work as you still remove all the dash ("-") characters from the inital string. If you want it to work, remove it from the list punct. The updated code looks like this:

punct=['!', '#', '"', '%', '$', '&', ')', '(', '+', '*'] 

 with open(myFile, "r") as f:
      text= f.read()
      remove = '|'.join(REMOVE_LIST) #list of word to remove
      regex = re.compile(r'('+remove+r')', flags=re.IGNORECASE) 
      out = regex.sub("", text)

      delta= " ".join(out.split())
      txt = "".join(c for c in delta.replace("-", " ") if c not in punct )

The problem comes from the fact that you are replacing all the characters in punct with an empty string, and you want a space for the "-". Thus, you need to replace the characters twice (once with empty string, and once with a space).



来源:https://stackoverflow.com/questions/41225435/python-remove-punctuation-from-a-text-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!