Weird behaviour initializing a numpy array of string data

前端 未结 5 1197
挽巷
挽巷 2020-12-05 22:38

I am having some seemingly trivial trouble with numpy when the array contains string data. I have the following code:

my_array = numpy.empty([1, 2], dty         


        
相关标签:
5条回答
  • 2020-12-05 23:21

    Numpy requires string arrays to have a fixed maximum length. When you create an empty array with dtype=str, it sets this maximum length to 1 by default. You can see if you do my_array.dtype; it will show "|S1", meaning "one-character string". Subsequent assignments into the array are truncated to fit this structure.

    You can pass an explicit datatype with your maximum length by doing, e.g.:

    my_array = numpy.empty([1, 2], dtype="S10")
    

    The "S10" will create an array of length-10 strings. You have to decide how big will be big enough to hold all the data you want to hold.

    0 讨论(0)
  • 2020-12-05 23:22

    I got a "codec error" when I tried to use a non-ascii character with dtype="S10"

    You also get an array with binary strings, which confused me.

    I think it is better to use:

    my_array = numpy.empty([1, 2], dtype="<U10")

    Here 'U10' translates to "Unicode string of length 10; little endian format"

    0 讨论(0)
  • 2020-12-05 23:26

    The numpy string array is limited by its fixed length (length 1 by default). If you're unsure what length you'll need for your strings in advance, you can use dtype=object and get arbitrary length strings for your data elements:

    my_array = numpy.empty([1, 2], dtype=object)
    

    I understand there may be efficiency drawbacks to this approach, but I don't have a good reference to support that.

    0 讨论(0)
  • 2020-12-05 23:27

    Another alternative is to initialize as follows:

    my_array = np.array([["CAT","APPLE"],['','']], dtype=str)
    

    In other words, first you write a regular array with what you want, then you turn it into a numpy array. However, this will fix your max string length to the length of the longest string at initialization. So if you were to add

    my_array[1,0] = 'PINEAPPLE'
    

    then the string stored would be 'PINEA'.

    0 讨论(0)
  • 2020-12-05 23:31

    What works best if you are doing a for loop is to start a list comprehension, which will allow you to allocate the right memory.

    data = ['CAT','APPLE,'CARROT']
    my_array = [name for name in data]
    
    0 讨论(0)
提交回复
热议问题