I have a very large dataset (roughly 200000x400), however I have it filtered and only a few hundred values remain, the rest are NaN. I would like to create a list of indexes
Convert the dataframe to it's equivalent NumPy
array representation and check for NaNs
present. Later, take the negation of it's corresponding indices (indicating non nulls) using numpy.argwhere. Since the output required must be a list of tuples, you could then make use of generator map
function applying tuple
as function to every iterable of the resulting array.
>>> list(map(tuple, np.argwhere(~np.isnan(df.values))))
[(0, 2), (2, 1), (4, 0), (4, 2)]
assuming that your column names are of int
dtype:
In [73]: df
Out[73]:
0 1 2
0 NaN NaN 1.20
1 NaN NaN NaN
2 NaN 1.1 NaN
3 NaN NaN NaN
4 1.4 NaN 1.01
In [74]: df.columns.dtype
Out[74]: dtype('int64')
In [75]: df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
Out[75]: [(0, 2), (2, 1), (4, 0), (4, 2)]
if your column names are of object
dtype:
In [81]: df.columns.dtype
Out[81]: dtype('O')
In [83]: df.stack().reset_index().astype(int).drop(0,1).apply(tuple, axis=1).tolist()
Out[83]: [(0, 2), (2, 1), (4, 0), (4, 2)]
Timing for 50K rows DF:
In [89]: df = pd.concat([df] * 10**4, ignore_index=True)
In [90]: df.shape
Out[90]: (50000, 3)
In [91]: %timeit list(map(tuple, np.argwhere(~np.isnan(df.values))))
10 loops, best of 3: 144 ms per loop
In [92]: %timeit df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
1 loop, best of 3: 1.67 s per loop
Conclusion: the Nickil Maveli's solution is 12 times faster for this test DF