Have a data in such format in .txt file:
UserId WordID
1 20
1 30
1 40
2 25
2 16
3 56
3 44
3 12
while you might be more interested in doing it in pandas
depending on your purpose, the numpy way would be:
userid,wordid = np.loadtxt('/data/file.txt',skiprows=1,unpack=True)
#example use:
mylist = []
for uid in np.unique(userid):
mylist.append(wordid[userid==uid])
I think you can use groupby with apply tolist
with values:
print df.groupby('UserId')['WordID'].apply(lambda x: x.tolist()).values
[[20, 30, 40] [25, 16] [56, 44, 12]]
Or apply list
, thank you B.M.
print df.groupby('UserId')['WordID'].apply(list).values
[[20, 30, 40] [25, 16] [56, 44, 12]]
Timings:
df = pd.concat([df]*1000).reset_index(drop=True)
In [358]: %timeit df.groupby('UserId')['WordID'].apply(list).values
1000 loops, best of 3: 1.22 ms per loop
In [359]: %timeit df.groupby('UserId')['WordID'].apply(lambda x: x.tolist()).values
1000 loops, best of 3: 1.23 ms per loop
If you are concerned with performance issues, like often numpy is better :
df=pd.read_csv('file.txt')
def numpyway():
u,v=df.values.T
ind=argsort(u,kind='mergesort') # stable sort to preserve order
return np.split(v[ind],add(1,*where(diff(u[ind]))))
In [12]: %timeit numpyway() # on 8000 lines
10000 loops, best of 3: 250 µs per loop
If 'UserId' is already sorted, it is yet three times faster.