Remove all punctuation from string, except if it's between digits

前提是你 提交于 2020-08-09 08:00:29

问题


I have a text that contains words and numbers. I'll give a representative example of the text:

string = "This is a 1example of the text. But, it only is 2.5 percent of all data"

I'd like to convert it to something like:

"This is a  1 example of the text But it only is  2.5  percent of all data"

So removing punctuation (can be . , or any other in string.punctuation) and also put a space between digits and words when it is concatenated. But keep the floats like 2.5 in my example.

I used the following code:

item = "This is a 1example of the text. But, it only is 2.5 percent of all data"
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
# This a start but not there yet !
#item = ' '.join([x.strip(string.punctuation) for x in item.split() if x not in string.digits])
item = ' '.join(re.split(r'(\d+)', item) )
print item

The result is :

 >> "This is a  1 example of the text. But, it only is  2 . 5  percent of all data"

I'm almost there but can't figure out that last peace.


回答1:


You can use regex lookarounds like this:

(?<!\d)[.,;:](?!\d)

Working demo

The idea is to have a character class gathering the punctuation you want to replace and use lookarounds to match punctuation that does not have digits around

regex = r"(?<!\d)[.,;:](?!\d)"

test_str = "This is a 1example of the text. But, it only is 2.5 percent of all data"

result = re.sub(regex, "", test_str, 0)

Result is:

This is a 1example of the text But it only is 2.5 percent of all data



回答2:


Okay folks, here is an answer (the best ? I don't know but it seems to work) :

item = "This is a 1example 2Ex of the text.But, it only is 2.5 percent of all data?"
#if there is two strings contatenated with the second starting with capital letter
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
#if a word starts with a digit like "1example"
item = ' '.join(re.split(r'(\d+)([A-Za-z]+)', item) )
#Magical line that removes punctuation apart from floats
item = re.sub('\S+', lambda m: re.match(r'^\W*(.*\w)\W*$', m.group()).group(1), item)
item = item.replace("  "," ")
print item



回答3:


I am out of touch with Python, but have some insight into the regexps. My I suggest the usage of or? I would use this regexp: "(\d+)([a-zA-Z])|([a-zA-Z])(\d+)", and then as the replacement string use: "\1 \2"
If some corner cases plague you, you can pass the back-reference to a procedure, and then deal 1-by-1, probably by checking if your "\1\2" can translate to float. TCL has such built-in functionality, Python should too.




回答4:


I tried this and it worked very well.

a = "This is a 1example of the text. But, it only is 2.5 percent of all data" a.replace(". ", " ").replace(", "," ")

Notice that, in replace function there is space after punctuation. I just replaced punctuation and space with only space.




回答5:


Code:

from itertools import groupby

s1 = "This is a 1example of the text. But, it only is 2.5 percent of all data"
s2 = [''.join(g) for _, g in groupby(s1, str.isalpha)]
s3 = ' '.join(s2).replace("   ", "  ").replace("  ", " ")

#you can keep adding a replace for each ponctuation
s4 = s3.replace(". ", " ").replace(", "," ").replace("; "," ").replace(", "," ").replace("- "," ").replace("? "," ").replace("! "," ").replace(" ("," ").replace(") "," ").replace('" '," ").replace(' "'," ").replace('... '," ").replace('/ '," ").replace(' “'," ").replace('” '," ").replace('] '," ").replace(' ['," ")

s5 = s4.replace("  ", " ")
print(s5)

Output:

'This is a 1 example of the text But it only is 2.5 percent of all data'

P.s.: You can take a look at Punctuation Marks and keep adding them inside the .replace() function.




回答6:


This is a regex approach

([^ ]?)(?:[^\P{punct}.]|(?<!\d)\.(?!\d))([^ ]?)

Replace in a callback:

if $1 length > 0 and $2 length > 0
replace with $1 + space + $2
else replace with $1$2

Expanded

 ( [^ ]? )                     # (1)
 (?:
      [^\P{punct}.] 
   |  
      (?<! \d )
      \.
      (?! \d )
 )
 ( [^ ]? )                     # (2)

If you don't want to use logic for chars adjacent to punct
Use (?:[^\P{punct}.]|(?<!\d)\.(?!\d)) and replace with nothing.



来源:https://stackoverflow.com/questions/43142710/remove-all-punctuation-from-string-except-if-its-between-digits

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!