Reading unicode elements into numpy array

前端 未结 2 600
逝去的感伤
逝去的感伤 2020-12-06 17:36

Consider a text file called \"new.txt\" containing the following elements:

μm
∂r
∆λ

In Python 2.7, I can read the file by typing:



        
相关标签:
2条回答
  • 2020-12-06 18:02

    If you want to use loadtxt, you can either first load the raw byte array and then decode:

    data = np.loadtxt('foo.txt', dtype='S8')
    unicode_data = data.view(np.chararray).decode('utf-8')
    

    or specify a converter for decoding:

    data = np.loadtxt('foo.txt', converters={0: lambda x: unicode(x, 'utf-8')}, dtype='U2')
    

    However, using fromiter as in Sven's answer is probably going to be more effective than loadtxt.

    0 讨论(0)
  • 2020-12-06 18:03

    In memory, unicode strings are represented as UCS-2 or UCS-4, depending on how your Python interpreter was compiled. Your file is encoded in UTF-8, so you need to recode it before you can map it to the NumPy array. loadtxt() can't do the recoding for you -- after all NumPy is mainly targeted at numerical arrays.

    Assuming every line has the same number of characters, you could also use the more efficient variant

    s = codecs.open("new.txt", encoding="utf-8").read()
    arr = numpy.frombuffer(s, dtype="<U3")
    

    This will include the newline characters in the strings. To not include them, use

    arr = numpy.frombuffer(s.replace("\n", ""), dtype="<U2")
    

    Edit: If the lines of your file have different lengths and you would like to avoid the intermediate list, you can use

    arr = numpy.fromiter(codecs.open("new.txt", encoding="utf-8"), dtype="<U2")
    

    I'm not sure if this will internally create some temporary list, though.

    0 讨论(0)
提交回复
热议问题