Remove all characters from a string who's ordinals are out of range

前端 未结 3 1522
眼角桃花
眼角桃花 2021-01-14 07:34

What is a good way to remove all characters that are out of the range: ordinal(128) from a string in python?

I\'m using hashlib.sha256 in python 2.7. I\

相关标签:
3条回答
  • 2021-01-14 08:17

    Instead of removing those characters, it would be better to use an encoding that hashlib won't choke on, utf-8 for example:

    >>> data = u'\u200e'
    >>> hashlib.sha256(data.encode('utf-8')).hexdigest()
    'e76d0bc0e98b2ad56c38eebda51da277a591043c9bc3f5c5e42cd167abc7393e'
    
    0 讨论(0)
  • 2021-01-14 08:30
    new_safe_str = some_string.encode('ascii','ignore') 
    

    I think would work

    or you could do a list comprehension

    "".join([ch for ch in orig_string if ord(ch)<= 128])
    

    [edit] however as others have said it may be better to figure out how to deal with unicode in general... unless you really need it encoded as ascii for some reason

    0 讨论(0)
  • 2021-01-14 08:30

    This is an example of where the changes in python3 will make an improvement, or at least generate a clearer error message

    Python2

    >>> import hashlib
    >>> funky_string=u"You owe me £100"
    >>> hashlib.sha256(funky_string)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 11: ordinal not in range(128)
    >>> hashlib.sha256(funky_string.encode("utf-8")).hexdigest()
    '81ebd729153b49aea50f4f510972441b350a802fea19d67da4792b025ab6e68e'
    >>> 
    

    Python3

    >>> import hashlib
    >>> funky_string="You owe me £100"
    >>> hashlib.sha256(funky_string)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: Unicode-objects must be encoded before hashing
    >>> hashlib.sha256(funky_string.encode("utf-8")).hexdigest()
    '81ebd729153b49aea50f4f510972441b350a802fea19d67da4792b025ab6e68e'
    >>> 
    

    The real problem is that sha256 takes a sequence of bytes which python2 doesn't have a clear concept of. Use .encode("utf-8") is what I'd suggest.

    0 讨论(0)
提交回复
热议问题