How to properly iterate over unicode characters in Python

前端 未结 2 580
走了就别回头了
走了就别回头了 2021-01-25 05:26

I would like to iterate over a string and output all emojis.

I\'m trying to iterate over the characters, and check them against an emoji list.

However, python se

相关标签:
2条回答
  • 2021-01-25 05:44

    Try this,

    import re
    re.findall(r'[^\w\s,]', my_list[0])
    

    The regex r'[^\w\s,]' matches any character that is not a word, whitespace or comma.

    0 讨论(0)
  • 2021-01-25 06:05

    Python pre-3.3 uses UTF-16LE (narrow build) or UTF-32LE (wide build) internally for storing Unicode, and due to leaky abstraction exposes this detail to the user. UTF-16LE uses surrogate pairs to represent Unicode characters above U+FFFF as two codepoints. Either use a wide Python build or switch to Python 3.3 or later to fix the issue.

    One way of dealing with a narrow build is to match the surrogate pairs:

    Python 2.7 (narrow build):

    >>> s = u'Test \U0001f60d'
    >>> len(s)
    7
    >>> re.findall(u'(?:[\ud800-\udbff][\udc00-\udfff])|.',s)
    [u'T', u'e', u's', u't', u' ', u'\U0001f60d']
    

    Python 3.6:

    >>> s = 'Test \U0001f60d'
    >>> len(s)
    6
    >>> list(s)
    ['T', 'e', 's', 't', ' ', '                                                                    
    0 讨论(0)
提交回复
热议问题