I have a string that kind of looks like this:
\"stuff . // : /// more-stuff .. .. ...$%$% stuff -> DD\"
and I want to strip off all p
result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
result = rex.sub('', result) # this reduces all those spaces
Because you typo'd and forgot to use rex_s for the second call instead. Also, you need to substitute at least one space back in or you'll end up with any multiple-space gap becoming no gap at all, instead of a single-space gap.
result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
result = rex_s.sub(' ', result) # this reduces all those spaces
s = "$$$aa1bb2 cc-dd ee_ff ggg."
re.sub(r'\W+', ' ', s).upper()
# ' AA1BB2 CC DD EE_FF GGG '
Is _ punctuation?
re.sub(r'[_\W]+', ' ', s).upper()
# ' AA1BB2 CC DD EE FF GGG '
Don't want the leading and trailing space?
re.sub(r'[_\W]+', ' ', s).strip().upper()
# 'AA1BB2 CC DD EE FF GGG'
One can use regular expression to substitute reoccurring white spaces.
White space is given by \s
with \s+
meaning: at least one.
import re
rex = re.compile(r'\s+')
test = " x y z z"
res = rex.sub(' ', test)
print(f">{res}<")
Note this also affects/includes carriage return, etc.
Here's a single-step approach (but the uppercasing actually uses a string method -- much simpler!):
rex = re.compile(r'\W+')
result = rex.sub(' ', strarg).upper()
where strarg
is the string argument (don't use names that shadow builtins or standard library modules, please).
Do you have to use regular expressions? Do you feel you must do it in one line?
>>> import string
>>> s = "stuff . // : /// more-stuff .. .. ...$%$% stuff -> DD"
>>> s2 = ''.join(c for c in s if c in string.letters + ' ')
>>> ' '.join(s2.split())
'stuff morestuff stuff DD'
works in python3 this will retain the same whitespace character you collapsed. So if you have a tab and a space next to each other they wont collapse into a single character.
def collapse_whitespace_characters(raw_text):
ret = ''
if len(raw_text) > 1:
prev_char = raw_text[0]
ret += prev_char
for cur_char in raw_text[1:]:
if not cur_char.isspace() or cur_char != prev_char:
ret += cur_char
prev_char = cur_char
else:
ret = raw_text
return ret
this one will collapse whitespace sets into the first whitespace character it sees
def collapse_whitespace(raw_text):
ret = ''
if len(raw_text) > 1:
prev_char = raw_text[0]
ret += prev_char
for cur_char in raw_text[1:]:
if not cur_char.isspace() or \
(cur_char.isspace() and not prev_char.isspace()):
ret += cur_char
prev_char = cur_char
else:
ret = raw_text
return ret
>>> collapse_whitespace_characters('we like spaces and\t\t TABS AND WHATEVER\xa0\xa0IS')
'we like spaces and\t TABS\tAND WHATEVER\xa0IS'
>>> collapse_whitespace('we like spaces and\t\t TABS AND WHATEVER\xa0\xa0IS')
'we like spaces and\tTABS\tAND WHATEVER\xa0IS'
for punctuation
def collapse_punctuation(raw_text):
ret = ''
if len(raw_text) > 1:
prev_char = raw_text[0]
ret += prev_char
for cur_char in raw_text[1:]:
if cur_char.isalnum() or cur_char != prev_char:
ret += cur_char
prev_char = cur_char
else:
ret = raw_text
return ret
to actually answer the question
orig = 'stuff . // : /// more-stuff .. .. ...$%$% stuff -> DD'
collapse_whitespace(''.join([(c.upper() if c.isalnum() else ' ') for c in orig]))
as said, the regexp would be something like
re.sub('\W+', ' ', orig).upper()