How to properly iterate over unicode characters in Python

前端未结

关注

 2  580

走了就别回头了

I would like to iterate over a string and output all emojis.

I\'m trying to iterate over the characters, and check them against an emoji list.

However, python se

相关标签:

2条回答

被撕碎了的回忆

2021-01-25 05:44
Try this,
```
import re
re.findall(r'[^\w\s,]', my_list[0])
```
The regex r'[^\w\s,]' matches any character that is not a word, whitespace or comma.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2021-01-25 06:05
Python pre-3.3 uses UTF-16LE (narrow build) or UTF-32LE (wide build) internally for storing Unicode, and due to leaky abstraction exposes this detail to the user. UTF-16LE uses surrogate pairs to represent Unicode characters above U+FFFF as two codepoints. Either use a wide Python build or switch to Python 3.3 or later to fix the issue.

One way of dealing with a narrow build is to match the surrogate pairs:

Python 2.7 (narrow build):
```
>>> s = u'Test \U0001f60d'
>>> len(s)
7
>>> re.findall(u'(?:[\ud800-\udbff][\udc00-\udfff])|.',s)
[u'T', u'e', u's', u't', u' ', u'\U0001f60d']
```
Python 3.6:
```
>>> s = 'Test \U0001f60d'
>>> len(s)
6
>>> list(s)
['T', 'e', 's', 't', ' ', '                                                                    
                                                        
            
```
0 讨论(0) 发布评论: 提交评论加载中...