I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm
but now I want to do that usi
For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet and it allows you to filter on partitioned S3 keys.
to install do;
pip install awswrangler
To reduce the data you read, you can filter rows based on the partitioned columns from your parquet file stored on s3.
To filter the rows from the partitioned column event_name
with the value "SomeEvent"
do;
for awswrangler < 1.0.0
import awswrangler as wr
df = wr.pandas.read_parquet(
path="s3://my-bucket/my/path/to/parquet-file.parquet",
columns=["event_name"],
filters=[('event_name', '=', 'SomeEvent')]
)
for awswrangler > 1.0.0 do;
import awswrangler as wr
df = wr.s3.read_parquet(
path="s3://my-bucket/my/path/to/parquet-file.parquet",
columns=["event_name"],
filters=[('event_name', '=', 'SomeEvent')]
)