I\'m building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here\'s an example string:
Sapphire RX460 OC
You can use a function like below with regular expression to scan for continuous spaces and replace them by 1 space
import re
def clean_data(data):
return re.sub(" {2,}", " ", data.strip())
product_title = clean(product.css('h3::text').extract_first())
And then improve clean function anyway you like it
You can use:
" ".join(s.split())
where s
is your string.
Instead of using regex's for this, a more efficient solution is to use the join/split option, observe:
>>> timeit.Timer((lambda:' '.join(' Sapphire RX460 OC 2/4GB'.split()))).timeit()
0.7263979911804199
>>> def f():
return re.sub(" +", ' ', " Sapphire RX460 OC 2/4GB").split()
>>> timeit.Timer(f).timeit()
4.163465976715088