Replace special characters in python

后端 未结 2 1286
迷失自我
迷失自我 2021-01-06 14:15

I have some text coming from the web as such:

£6.49

Obviously I would like this to be displayed as:

£6.49

I have tried the following so far:

相关标签:
2条回答
  • 2021-01-06 15:12

    Edit: you have your objects already in unicode. Seems to me there is no reason to actually use enocde/decode at all.

    >>> print u'Oscar Winners Best Pictures Box Set \xc2\xa36.49'.replace(u'Â','')
    Oscar Winners Best Pictures Box Set £6.49
    

    However it seems to me that something is wrong there. The unicode objects are actually not unicode; see:

    >>> print 'Oscar Winners Best Pictures Box Set \xc2\xa36.49'.decode('utf8')
    Oscar Winners Best Pictures Box Set £6.49
    

    The repr() you posted should not be unicode object. That's why I was asking where are you getting the data, there is something wrong.

    0 讨论(0)
  • 2021-01-06 15:13

    If, s=url['title'] makes s equal to this:

    In [48]: s=u'Oscar Winners Best Pictures Box Set \xc2\xa36.49'
    

    Then the problem is

    1. in the code that defines url,
    2. or else the content from the web is mal-formed.

    If Case 1, we'd need to see the code that defines url.

    If Case 2, a quick-and-dirty workaround would be to encode the unicode object s with the raw-unicode-escape codec:

    In [49]: print(s)
    Oscar Winners Best Pictures Box Set £6.49
    
    In [50]: print(s.encode('raw-unicode-escape'))
    Oscar Winners Best Pictures Box Set £6.49
    

    See also this SO question.


    Regarding titles like s=u'Star Trek XI £3.99': Again, it would be nice fix the problem before it gets to this stage -- perhaps by looking at how url is defined. But assuming the content from the web is mal-formed, a workaround would be:

    In [86]: import re
    
    In [87]: print(re.sub(r'&#x([a-fA-F\d]+);',lambda m: unichr(int(m.group(1),base=16)),s))
    Star Trek XI £3.99
    

    A little bit of explanation:

    Note that

    In [51]: x=u'£'
    In [53]: x.encode('utf-8')
    Out[53]: '\xc2\xa3'
    

    So the unicode object u'£', encoded with the utf-8 codec, becomes the string object '\xc2\xa3'.

    Somehow, url['title'] is getting defined to be the unicode object u'\xc2\xa3'. (The u makes a big difference!)

    Thus we have u'\xc2\xa3' when we desire '\xc2\xa3'. Encoding the unicode object u'\xc2\xa3' with the raw-unicode-escape codec transforms it to '\xc2\xa3'.

    0 讨论(0)
提交回复
热议问题