Loading UTF-8 file in Python 3 using numpy.genfromtxt

后端未结

关注

 1  771

I have a CSV file that I downloaded from WHO site (http://apps.who.int/gho/data/view.main.52160 , Downloads, \"multipurpose table in CSV format\"). I try to load the file in

相关标签:

1条回答

刺人心

2020-11-29 13:45
In Python3 I can do:
```
In [224]: txt = "Côte d'Ivoire"
In [225]: x = np.zeros((2,),dtype='U20')
In [226]: x[0] = txt
In [227]: x
Out[227]: 
array(["Côte d'Ivoire", ''],   dtype='<U20')
```
Which means I probably could open a 'UTF-8' file (regular, not byte mode), and readlines, and assign them to elements of an array like x.

But genfromtxt insists on operating with byte strings (ascii) which can't handle the larger UTF-8 set (7 bytes v 8). So I need to apply decode at some point to get an U array.

I can load it into a 'S' array with genfromtxt:
```
In [258]: txt="Côte d'Ivoire"
In [259]: a=np.genfromtxt([txt.encode()],delimiter=',',dtype='S20')
In [260]: a
Out[260]: 
array(b"C\xc3\xb4te d'Ivoire",  dtype='|S20')
```
and apply decode to individual elements:
```
In [261]: print(a.item().decode())
Côte d'Ivoire

In [325]: print _
Côte d'Ivoire
```
Or use np.char.decode to apply it to each element of an array:
```
In [263]: np.char.decode(a)
Out[263]: 
array("Côte d'Ivoire", dtype='<U13')
In [264]: print(_)
Côte d'Ivoire
```
genfromtxt lets me specify converters:
```
In [297]: np.genfromtxt([txt.encode()],delimiter=',',dtype='U20',
    converters={0:lambda x: x.decode()})
Out[297]: 
array("Côte d'Ivoire", dtype='<U20')
```
If the csv has a mix of strings and numbers, this converters approach will be easier to use than the np.char.decode. Just specify the converter for each string column.

(See my earlier edits for Python2 tries).
0 讨论(0)
发布评论:

提交评论
- 加载中...