How to read partitioned parquet files from S3 using pyarrow in python

前端 未结 5 1287
时光说笑
时光说笑 2020-12-07 21:03

I looking for ways to read data from multiple partitioned directories from s3 using python.

data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parq

5条回答
  •  囚心锁ツ
    2020-12-07 21:38

    For those of you who want to read in only parts of a partitioned parquet file, pyarrow accepts a list of keys as well as just the partial directory path to read in all parts of the partition. This method is especially useful for organizations who have partitioned their parquet datasets in a meaningful like for example by year or country allowing users to specify which parts of the file they need. This will reduce costs in the long run as AWS charges per byte when reading in datasets.

    # Read in user specified partitions of a partitioned parquet file 
    
    import s3fs
    import pyarrow.parquet as pq
    s3 = s3fs.S3FileSystem()
    
    keys = ['keyname/blah_blah/part-00000-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
             ,'keyname/blah_blah/part-00001-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
             ,'keyname/blah_blah/part-00002-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet'\
             ,'keyname/blah_blah/part-00003-cc2c2113-3985-46ac-9b50-987e9463390e-c000.snappy.parquet']
    
    bucket = 'bucket_yada_yada_yada'
    
    # Add s3 prefix and bucket name to all keys in list
    parq_list=[]
    for key in keys:
        parq_list.append('s3://'+bucket+'/'+key)
    
    # Create your dataframe
    df = pq.ParquetDataset(parq_list, filesystem=s3).read_pandas(columns=['Var1','Var2','Var3']).to_pandas()
    

提交回复
热议问题