How to read partitioned parquet files from S3 using pyarrow in python

前端 未结 5 1288
时光说笑
时光说笑 2020-12-07 21:03

I looking for ways to read data from multiple partitioned directories from s3 using python.

data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parq

相关标签:
5条回答
  • 2020-12-07 21:30

    For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

    to install do;

    pip install awswrangler
    

    to read partitioned parquet from s3 using awswrangler 1.x.x and above, do;

    import awswrangler as wr
    df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/", dataset=True)
    

    By setting dataset=True awswrangler expects partitioned parquet files. It will read all the individual parquet files from your partitions below the s3 key you specify in the path.

    0 讨论(0)
  • 2020-12-07 21:31

    I managed to get this working with the latest release of fastparquet & s3fs. Below is the code for the same:

    import s3fs
    import fastparquet as fp
    s3 = s3fs.S3FileSystem()
    fs = s3fs.core.S3FileSystem()
    
    #mybucket/data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet 
    s3_path = "mybucket/data_folder/*/*/*.parquet"
    all_paths_from_s3 = fs.glob(path=s3_path)
    
    myopen = s3.open
    #use s3fs as the filesystem
    fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen)
    #convert to pandas dataframe
    df = fp_obj.to_pandas()
    

    credits to martin for pointing me in the right direction via our conversation

    NB : This would be slower than using pyarrow, based on the benchmark . I will update my answer once s3fs support is implemented in pyarrow via ARROW-1213

    I did quick benchmark on on indivdual iterations with pyarrow & list of files send as a glob to fastparquet. fastparquet is faster with s3fs vs pyarrow + my hackish code. But I reckon pyarrow +s3fs will be faster once implemented.

    The code & benchmarks are below :

    >>> def test_pq():
    ...     for current_file in list_parquet_files:
    ...         f = fs.open(current_file)
    ...         df = pq.read_table(f).to_pandas()
    ...         # following code is to extract the serial_number & cur_date values so that we can add them to the dataframe
    ...         #probably not the best way to split :)
    ...         elements_list=current_file.split('/')
    ...         for item in elements_list:
    ...             if item.find(date_partition) != -1:
    ...                 current_date = item.split('=')[1]
    ...             elif item.find(dma_partition) != -1:
    ...                 current_dma = item.split('=')[1]
    ...         df['serial_number'] = current_dma
    ...         df['cur_date'] = current_date
    ...         list_.append(df)
    ...     frame = pd.concat(list_)
    ...
    >>> timeit.timeit('test_pq()',number =10,globals=globals())
    12.078817503992468
    
    >>> def test_fp():
    ...     fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen)
    ...     df = fp_obj.to_pandas()
    
    >>> timeit.timeit('test_fp()',number =10,globals=globals())
    2.961556333000317
    

    Update 2019

    After all PRs, Issues such as Arrow-2038 & Fast Parquet - PR#182 have been resolved.

    Read parquet files using Pyarrow

    # pip install pyarrow
    # pip install s3fs
    
    >>> import s3fs
    >>> import pyarrow.parquet as pq
    >>> fs = s3fs.S3FileSystem()
    
    >>> bucket = 'your-bucket-name'
    >>> path = 'directory_name' #if its a directory omit the traling /
    >>> bucket_uri = f's3://{bucket}/{path}'
    's3://your-bucket-name/directory_name'
    
    >>> dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
    >>> table = dataset.read()
    >>> df = table.to_pandas() 
    

    Read parquet files using Fast parquet

    # pip install s3fs
    # pip install fastparquet
    
    >>> import s3fs
    >>> import fastparquet as fp
    
    >>> bucket = 'your-bucket-name'
    >>> path = 'directory_name'
    >>> root_dir_path = f'{bucket}/{path}'
    # the first two wild card represents the 1st,2nd column partitions columns of your data & so forth
    >>> s3_path = f"{root_dir_path}/*/*/*.parquet"
    >>> all_paths_from_s3 = fs.glob(path=s3_path)
    
    >>> fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen, root=root_dir_path)
    >>> df = fp_obj.to_pandas()
    

    Quick benchmarks

    This is probably not the best way to benchmark it. please read the blog post for a through benchmark

    #pyarrow
    >>> import timeit
    >>> def test_pq():
    ...     dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
    ...     table = dataset.read()
    ...     df = table.to_pandas()
    ...
    >>> timeit.timeit('test_pq()',number =10,globals=globals())
    1.2677053569998407
    
    #fastparquet
    >>> def test_fp():
    ...     fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen, root=root_dir_path)
    ...     df = fp_obj.to_pandas()
    
    >>> timeit.timeit('test_fp()',number =10,globals=globals())
    2.931876824000028
    

    Further reading regarding Pyarrow's speed

    Reference :

    • fastparquet
    • s3fs
    • pyarrow
    • pyarrow arrow code based on discussion & also documentation
    • fastparquet code based on discussions PR-182 , PR-182 & also documentation
    0 讨论(0)
  • 2020-12-07 21:35

    This issue was resolved in this pull request in 2017.

    For those who want to read parquet from S3 using only pyarrow, here is an example:

    import s3fs
    import pyarrow.parquet as pq
    from pyarrow.filesystem import S3FSWrapper
    
    fs = s3fs.S3FileSystem()
    bucket = "your-bucket"
    path = "your-path"
    
    # Python 3.6 or later
    p_dataset = pq.ParquetDataset(
        f"s3://{bucket}/{path}",
        filesystem=fs
    )
    df = p_dataset.read().to_pandas()
    
    # Pre-python 3.6
    p_dataset = pq.ParquetDataset(
        "s3://{0}/{1}".format(bucket, path),
        filesystem=fs
    )
    df = p_dataset.read().to_pandas()
    
    0 讨论(0)
  • 2020-12-07 21:38

    For those of you who want to read in only parts of a partitioned parquet file, pyarrow accepts a list of keys as well as just the partial directory path to read in all parts of the partition. This method is especially useful for organizations who have partitioned their parquet datasets in a meaningful like for example by year or country allowing users to specify which parts of the file they need. This will reduce costs in the long run as AWS charges per byte when reading in datasets.

    # Read in user specified partitions of a partitioned parquet file 
    
    import s3fs
    import pyarrow.parquet as pq
    s3 = s3fs.S3FileSystem()
    
    keys = ['keyname/blah_blah/part-00000-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
             ,'keyname/blah_blah/part-00001-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
             ,'keyname/blah_blah/part-00002-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
             ,'keyname/blah_blah/part-00003-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet']
    
    bucket = 'bucket_yada_yada_yada'
    
    # Add s3 prefix and bucket name to all keys in list
    parq_list=[]
    for key in keys:
        parq_list.append('s3://'+bucket+'/'+key)
    
    # Create your dataframe
    df = pq.ParquetDataset(parq_list, filesystem=s3).read_pandas(columns=['Var1','Var2','Var3']).to_pandas()
    
    0 讨论(0)
  • 2020-12-07 21:38

    Let's discuss in https://issues.apache.org/jira/browse/ARROW-1213 and https://issues.apache.org/jira/browse/ARROW-1119. We must add some code to allow pyarrow to recognize the s3fs filesystem and add a shim / compatibility class to conform S3FS's slightly different filesystem API to pyarrow's.

    0 讨论(0)
提交回复
热议问题