I get some data from a webpage and read it like this in python
origional_doc = urllib2.urlopen(url).read()
Sometimes this url has characters su
This should work. It will eliminate all characters that are not ascii.
original_doc = (original_doc.decode('unicode_escape').encode('ascii','ignore'))
using re
you can sub all characters that are in a certain hexadecimal ascii range.
>>> re.sub('[\x80-\xFF]','','é and ä and ect')
' and and ect'
You can also do the inverse and sub anything thats NOT in the basic 128 characters:
>>> re.sub('[^\x00-\x7F]','','é and ä and ect')
' and and ect'