ElementTree and unicode

前端未结

关注

 6  788

I have this char in an xml file:


  
      fumè

I t

相关标签:

6条回答

闹比i

2020-12-06 00:51

Function open() does not return a string. Instead use open('file.xml').read().

0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2020-12-06 00:52
Have you tried using the parse function, instead of opening the file... (which BTW would require a .read() after it for the .fromstring() to work...)
```
import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()
# etc...
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2020-12-06 00:57
You do not need to decode XML for ElementTree to work. XML carries it's own encoding information (defaulting to UTF-8) and ElementTree does the work for you, outputting unicode:
```
>>> data = '''\
... <data>
...   <products>
...       <color>fumè</color>
...   </products>
... </data>
... '''
>>> x = ElementTree.fromstring(data)
>>> x[0][0].text
u'fum\xe8'
```
If your data is contained in a file(like) object, just pass the filename or file object directly to the ElementTree.parse() function:
```
x = ElementTree.parse('file.xml')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
一个人的身影

2020-12-06 01:01

Might you have stumbled upon this problem while using Requests (HTTP for Humans), response.text decodes the response by default, you can use response.content to get the undecoded data, so ElementTree can decode it itself. Just remember to use the correct encoding.

More info: http://docs.python-requests.org/en/latest/user/quickstart/#response-content

0 讨论(0)
发布评论:

提交评论
- 加载中...
渐次进展

2020-12-06 01:05
You need to decode utf-8 strings into a unicode object. So
```
string_data.encode('utf-8')
```
should be
```
string_data.decode('utf-8')
```
assuming string_data is actually an utf-8 string.

So to summarize: To get an utf-8 string from a unicode object you encode the unicode (using the utf-8 encoding), and to turn a string to a unicode object you decode the string using the respective encoding.

For more details on the concepts I suggest reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (not Python specific).
0 讨论(0)
发布评论:

提交评论
- 加载中...
迷失自我

2020-12-06 01:14

The most likely your file is not UTF-8. è character can be from some other encoding, latin-1 for example.

0 讨论(0)
发布评论:

提交评论
- 加载中...