How can I slice each element of a numpy array of strings?

前端 未结 4 2003
面向向阳花
面向向阳花 2020-12-01 16:25

Numpy has some very useful string operations, which vectorize the usual Python string operations.

Compared to these operation and to pandas.str, the num

相关标签:
4条回答
  • 2020-12-01 17:02

    Here's a vectorized approach -

    def slicer_vectorized(a,start,end):
        b = a.view((str,1)).reshape(len(a),-1)[:,start:end]
        return np.fromstring(b.tostring(),dtype=(str,end-start))
    

    Sample run -

    In [68]: a = np.array(['hello', 'how', 'are', 'you'])
    
    In [69]: slicer_vectorized(a,1,3)
    Out[69]: 
    array(['el', 'ow', 're', 'ou'], 
          dtype='|S2')
    
    In [70]: slicer_vectorized(a,0,3)
    Out[70]: 
    array(['hel', 'how', 'are', 'you'], 
          dtype='|S3')
    

    Runtime test -

    Testing out all the approaches posted by other authors that I could run at my end and also including the vectorized approach from earlier in this post.

    Here's the timings -

    In [53]: # Setup input array
        ...: a = np.array(['hello', 'how', 'are', 'you'])
        ...: a = np.repeat(a,10000)
        ...: 
    
    # @Alberto Garcia-Raboso's answer
    In [54]: %timeit slicer(1, 3)(a)
    10 loops, best of 3: 23.5 ms per loop
    
    # @hapaulj's answer
    In [55]: %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
    100 loops, best of 3: 11.6 ms per loop
    
    # Using loop-comprehension
    In [56]: %timeit np.array([i[1:3] for i in a])
    100 loops, best of 3: 12.1 ms per loop
    
    # From this post
    In [57]: %timeit slicer_vectorized(a,1,3)
    1000 loops, best of 3: 787 µs per loop
    
    0 讨论(0)
  • 2020-12-01 17:09

    Most, if not all the functions in np.char apply existing str methods to each element of the array. It's a little faster than direct iteration (or vectorize) but not drastically so.

    There isn't a string slicer; at least not by that sort of name. Closest is indexing with a slice:

    In [274]: 'astring'[1:3]
    Out[274]: 'st'
    In [275]: 'astring'.__getitem__
    Out[275]: <method-wrapper '__getitem__' of str object at 0xb3866c20>
    In [276]: 'astring'.__getitem__(slice(1,4))
    Out[276]: 'str'
    

    An iterative approach can be with frompyfunc (which is also used by vectorize):

    In [277]: a = numpy.array(['hello', 'how', 'are', 'you'])
    In [278]: np.frompyfunc(lambda x:x[1:3],1,1)(a)
    Out[278]: array(['el', 'ow', 're', 'ou'], dtype=object)
    In [279]: np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
    Out[279]: 
    array(['el', 'ow', 're', 'ou'], 
          dtype='<U2')
    

    I could view it as a single character array, and slice that

    In [289]: a.view('U1').reshape(4,-1)[:,1:3]
    Out[289]: 
    array([['e', 'l'],
           ['o', 'w'],
           ['r', 'e'],
           ['o', 'u']], 
          dtype='<U1')
    

    I still need to figure out how to convert it back to 'U2'.

    In [290]: a.view('U1').reshape(4,-1)[:,1:3].copy().view('U2')
    Out[290]: 
    array([['el'],
           ['ow'],
           ['re'],
           ['ou']], 
          dtype='<U2')
    

    The initial view step shows the databuffer as Py3 characters (these would be bytes in a S or Py2 string case):

    In [284]: a.view('U1')
    Out[284]: 
    array(['h', 'e', 'l', 'l', 'o', 'h', 'o', 'w', '', '', 'a', 'r', 'e', '',
           '', 'y', 'o', 'u', '', ''], 
          dtype='<U1')
    

    Picking the 1:3 columns amounts to picking a.view('U1')[[1,2,6,7,11,12,16,17]] and then reshaping and view. Without getting into details, I'm not surprised that it requires a copy.

    0 讨论(0)
  • 2020-12-01 17:12

    To solve this, so far I've been transforming the numpy array to a pandas Series and back. It is not a pretty solution, but it works and it works relatively fast.

    a = numpy.array(['hello', 'how', 'are', 'you'])
    pandas.Series(a).str[1:3].values
    array(['el', 'ow', 're', 'ou'], dtype=object)
    
    0 讨论(0)
  • 2020-12-01 17:15

    Interesting omission... I guess you can always write your own:

    import numpy as np
    
    def slicer(start=None, stop=None, step=1):
        return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
    
    a = np.array(['hello', 'how', 'are', 'you'])
    print(slicer(1, 3)(a))    # => ['el' 'ow' 're' 'ou']
    

    EDIT: Here are some benchmarks using the text of Ulysses by James Joyce. It seems the clear winner is @hpaulj's last strategy. @Divakar gets into the race improving on @hpaulj's last strategy.

    import numpy as np
    import requests
    
    ulysses = requests.get('http://www.gutenberg.org/files/4300/4300-0.txt').text
    a = np.array(ulysses.split())
    
    # Ufunc
    def slicer(start=None, stop=None, step=1):
        return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
    
    %timeit slicer(1, 3)(a)
    # => 1 loop, best of 3: 221 ms per loop
    
    # Non-mutating loop
    def loop1(a):
        out = np.empty(len(a), dtype=object)
        for i, word in enumerate(a):
            out[i] = word[1:3]
    
    %timeit loop1(a)
    # => 1 loop, best of 3: 262 ms per loop
    
    # Mutating loop
    def loop2(a):
        for i in range(len(a)):
            a[i] = a[i][1:3]
    
    b = a.copy()
    %timeit -n 1 -r 1 loop2(b)
    # 1 loop, best of 1: 285 ms per loop
    
    # From @hpaulj's answer
    %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
    # => 10 loops, best of 3: 141 ms per loop
    
    %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
    # => 1 loop, best of 3: 170 ms per loop
    
    %timeit a.view('U1').reshape(len(a),-1)[:,1:3].astype(object).sum(axis=1)
    # => 10 loops, best of 3: 60.7 ms per loop
    
    def slicer_vectorized(a,start,end):
        b = a.view('S1').reshape(len(a),-1)[:,start:end]
        return np.fromstring(b.tostring(),dtype='S'+str(end-start))
    
    %timeit slicer_vectorized(a,1,3)
    # => The slowest run took 5.34 times longer than the fastest.
    #    This could mean that an intermediate result is being cached.
    #    10 loops, best of 3: 16.8 ms per loop
    
    0 讨论(0)
提交回复
热议问题