Loading UTF-8 file in Python 3 using numpy.genfromtxt

后端 未结 1 771
野性不改
野性不改 2020-11-29 13:17

I have a CSV file that I downloaded from WHO site (http://apps.who.int/gho/data/view.main.52160 , Downloads, \"multipurpose table in CSV format\"). I try to load the file in

相关标签:
1条回答
  • 2020-11-29 13:45

    In Python3 I can do:

    In [224]: txt = "Côte d'Ivoire"
    In [225]: x = np.zeros((2,),dtype='U20')
    In [226]: x[0] = txt
    In [227]: x
    Out[227]: 
    array(["Côte d'Ivoire", ''],   dtype='<U20')
    

    Which means I probably could open a 'UTF-8' file (regular, not byte mode), and readlines, and assign them to elements of an array like x.

    But genfromtxt insists on operating with byte strings (ascii) which can't handle the larger UTF-8 set (7 bytes v 8). So I need to apply decode at some point to get an U array.

    I can load it into a 'S' array with genfromtxt:

    In [258]: txt="Côte d'Ivoire"
    In [259]: a=np.genfromtxt([txt.encode()],delimiter=',',dtype='S20')
    In [260]: a
    Out[260]: 
    array(b"C\xc3\xb4te d'Ivoire",  dtype='|S20')
    

    and apply decode to individual elements:

    In [261]: print(a.item().decode())
    Côte d'Ivoire
    
    In [325]: print _
    Côte d'Ivoire
    

    Or use np.char.decode to apply it to each element of an array:

    In [263]: np.char.decode(a)
    Out[263]: 
    array("Côte d'Ivoire", dtype='<U13')
    In [264]: print(_)
    Côte d'Ivoire
    

    genfromtxt lets me specify converters:

    In [297]: np.genfromtxt([txt.encode()],delimiter=',',dtype='U20',
        converters={0:lambda x: x.decode()})
    Out[297]: 
    array("Côte d'Ivoire", dtype='<U20')
    

    If the csv has a mix of strings and numbers, this converters approach will be easier to use than the np.char.decode. Just specify the converter for each string column.

    (See my earlier edits for Python2 tries).

    0 讨论(0)
提交回复
热议问题