HDFStore: table.select and RAM usage

前端 未结 1 478
粉色の甜心
粉色の甜心 2021-01-01 01:29

I am trying to select random rows from a HDFStore table of about 1 GB. RAM usage explodes when I ask for about 50 random rows.

I am using pandas 0-11-dev, pyth

相关标签:
1条回答
  • 2021-01-01 02:02

    This is a known issue, see the reference here: https://github.com/pydata/pandas/pull/2755

    Essentially the query is turned into a numexpr expression for evaluation. There is an issue where I can't pass a lot of or conditions to numexpr (its dependent on the total length of the generated expression).

    So I just limit the expression that we pass to numexpr. If it exceeds a certain number of or conditions, then the query is done as a filter, rather than an in-kernel selection. Basically this means the table is read and then reindexed.

    This is on my enhancements list: https://github.com/pydata/pandas/issues/2391 (17).

    As a workaround, just split your queries up into multiple ones and concat the results. Should be much faster, and use a constant amount of memory

    0 讨论(0)
提交回复
热议问题