I am trying to select random rows from a HDFStore table of about 1 GB. RAM usage explodes when I ask for about 50 random rows.
I am using pandas 0-11-dev, pyth
This is a known issue, see the reference here: https://github.com/pydata/pandas/pull/2755
Essentially the query is turned into a numexpr
expression for evaluation. There is an issue
where I can't pass a lot of or
conditions to numexpr (its dependent on the total length of the
generated expression).
So I just limit the expression that we pass to numexpr. If it exceeds a certain number of or
conditions, then the query is done as a filter, rather than an in-kernel selection. Basically this means the table is read and then reindexed.
This is on my enhancements list: https://github.com/pydata/pandas/issues/2391 (17).
As a workaround, just split your queries up into multiple ones and concat the results. Should be much faster, and use a constant amount of memory