Remove accented characters form string - Python

后端 未结 2 610
小蘑菇
小蘑菇 2021-01-28 10:09

I get some data from a webpage and read it like this in python

origional_doc = urllib2.urlopen(url).read()

Sometimes this url has characters su

相关标签:
2条回答
  • 2021-01-28 10:41

    This should work. It will eliminate all characters that are not ascii.

        original_doc = (original_doc.decode('unicode_escape').encode('ascii','ignore'))
    
    0 讨论(0)
  • 2021-01-28 10:41

    using re you can sub all characters that are in a certain hexadecimal ascii range.

    >>> re.sub('[\x80-\xFF]','','é and ä and ect')
    ' and  and ect'
    

    You can also do the inverse and sub anything thats NOT in the basic 128 characters:

    >>> re.sub('[^\x00-\x7F]','','é and ä and ect')
    ' and  and ect'
    
    0 讨论(0)
提交回复
热议问题