I have a very large dataset (roughly 200000x400), however I have it filtered and only a few hundred values remain, the rest are NaN. I would like to create a list of indexes
assuming that your column names are of int
dtype:
In [73]: df
Out[73]:
0 1 2
0 NaN NaN 1.20
1 NaN NaN NaN
2 NaN 1.1 NaN
3 NaN NaN NaN
4 1.4 NaN 1.01
In [74]: df.columns.dtype
Out[74]: dtype('int64')
In [75]: df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
Out[75]: [(0, 2), (2, 1), (4, 0), (4, 2)]
if your column names are of object
dtype:
In [81]: df.columns.dtype
Out[81]: dtype('O')
In [83]: df.stack().reset_index().astype(int).drop(0,1).apply(tuple, axis=1).tolist()
Out[83]: [(0, 2), (2, 1), (4, 0), (4, 2)]
Timing for 50K rows DF:
In [89]: df = pd.concat([df] * 10**4, ignore_index=True)
In [90]: df.shape
Out[90]: (50000, 3)
In [91]: %timeit list(map(tuple, np.argwhere(~np.isnan(df.values))))
10 loops, best of 3: 144 ms per loop
In [92]: %timeit df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
1 loop, best of 3: 1.67 s per loop
Conclusion: the Nickil Maveli's solution is 12 times faster for this test DF