Data loading using arrays in Python

前端 未结 3 1355
感情败类
感情败类 2021-01-21 20:17

Have a data in such format in .txt file:

UserId   WordID
  1       20
  1       30
  1       40
  2       25
  2       16
  3       56
  3       44
  3       12
         


        
相关标签:
3条回答
  • 2021-01-21 20:37

    while you might be more interested in doing it in pandas depending on your purpose, the numpy way would be:

    userid,wordid = np.loadtxt('/data/file.txt',skiprows=1,unpack=True)
    #example use:
    mylist = []
    for uid in np.unique(userid):
        mylist.append(wordid[userid==uid])
    
    0 讨论(0)
  • 2021-01-21 20:42

    I think you can use groupby with apply tolist with values:

    print df.groupby('UserId')['WordID'].apply(lambda x: x.tolist()).values
    [[20, 30, 40] [25, 16] [56, 44, 12]]
    

    Or apply list, thank you B.M.

    print df.groupby('UserId')['WordID'].apply(list).values
    [[20, 30, 40] [25, 16] [56, 44, 12]]
    

    Timings:

    df = pd.concat([df]*1000).reset_index(drop=True)
    
    In [358]: %timeit df.groupby('UserId')['WordID'].apply(list).values
    1000 loops, best of 3: 1.22 ms per loop
    
    In [359]: %timeit df.groupby('UserId')['WordID'].apply(lambda x: x.tolist()).values
    1000 loops, best of 3: 1.23 ms per loop
    
    0 讨论(0)
  • 2021-01-21 20:46

    If you are concerned with performance issues, like often numpy is better :

    df=pd.read_csv('file.txt')
    def numpyway():
        u,v=df.values.T
        ind=argsort(u,kind='mergesort') # stable sort to preserve order
        return np.split(v[ind],add(1,*where(diff(u[ind]))))
    
    
    In [12]: %timeit numpyway() # on 8000 lines
    10000 loops, best of 3: 250 µs per loop
    

    If 'UserId' is already sorted, it is yet three times faster.

    0 讨论(0)
提交回复
热议问题