Using predicates to filter rows from pyarrow.parquet.ParquetDataset

后端未结

关注

 4  1701

醉话见心 2021-02-09 07:39

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that usi

4条回答

别跟我提以往 (楼主)

2021-02-09 07:47

For anyone getting here from Google, you can now filter on rows in PyArrow when reading a Parquet file. Regardless if you read it via pandas or pyarrow.parquet.

From the documentation:

filters (List[Tuple] or List[List[Tuple]] or None (default)) – Rows which do not match the filter predicate will be removed from scanned data. Partition keys embedded in a nested directory structure will be exploited to avoid loading files at all if they contain no matching rows. If use_legacy_dataset is True, filters can only reference partition keys and only a hive-style directory structure is supported. When setting use_legacy_dataset to False, also within-file level filtering and different partitioning schemes are supported.

Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).

Predicates may also be passed as List[Tuple]. This form is interpreted as a single conjunction. To express OR in predicates, one must use the (preferred) List[List[Tuple]] notation.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...