Using predicates to filter rows from pyarrow.parquet.ParquetDataset

后端 未结 4 1700
醉话见心
醉话见心 2021-02-09 07:39

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that usi

4条回答
  •  温柔的废话
    2021-02-09 07:52

    For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet and it allows you to filter on partitioned S3 keys.

    to install do;

    pip install awswrangler
    

    To reduce the data you read, you can filter rows based on the partitioned columns from your parquet file stored on s3. To filter the rows from the partitioned column event_name with the value "SomeEvent" do;

    for awswrangler < 1.0.0

    import awswrangler as wr
    
    df = wr.pandas.read_parquet(
             path="s3://my-bucket/my/path/to/parquet-file.parquet",
             columns=["event_name"], 
             filters=[('event_name', '=', 'SomeEvent')]
    )
    

    for awswrangler > 1.0.0 do;

    import awswrangler as wr
    
    df = wr.s3.read_parquet(
             path="s3://my-bucket/my/path/to/parquet-file.parquet",
             columns=["event_name"], 
             filters=[('event_name', '=', 'SomeEvent')]
    )
    

提交回复
热议问题