Using predicates to filter rows from pyarrow.parquet.ParquetDataset

后端未结

关注

 4  1715

醉话见心 2021-02-09 07:39

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that usi

4条回答

旧巷少年郎 (楼主)

2021-02-09 07:48

Currently, the filters functionality is only implemented at the file level, not yet at the row level.

So if you have a dataset as a collection of multiple, partitioned parquet files in a nested hierarchy (the type of partitioned datasets described here: https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files), you can use the filters argument to only read a subset of the files.
But, you can't yet use it for reading only a subset of the row groups of a single file (see https://issues.apache.org/jira/browse/ARROW-1796).

But, it would be nice that you get an error message of specifying such an invalid filter. I opened an issue for that: https://issues.apache.org/jira/browse/ARROW-5572

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...