Using predicates to filter rows from pyarrow.parquet.ParquetDataset

后端未结

关注

 4  1700

醉话见心 2021-02-09 07:39

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that usi

4条回答

温柔的废话 (楼主)

2021-02-09 07:52
For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet and it allows you to filter on partitioned S3 keys.

to install do;
```
pip install awswrangler
```
To reduce the data you read, you can filter rows based on the partitioned columns from your parquet file stored on s3. To filter the rows from the partitioned column event_name with the value "SomeEvent" do;

for awswrangler < 1.0.0
```
import awswrangler as wr

df = wr.pandas.read_parquet(
         path="s3://my-bucket/my/path/to/parquet-file.parquet",
         columns=["event_name"], 
         filters=[('event_name', '=', 'SomeEvent')]
)
```
for awswrangler > 1.0.0 do;
```
import awswrangler as wr

df = wr.s3.read_parquet(
         path="s3://my-bucket/my/path/to/parquet-file.parquet",
         columns=["event_name"], 
         filters=[('event_name', '=', 'SomeEvent')]
)
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...