How to get list of all columns from a parquet file using s3 select?

后端 未结 1 739
不思量自难忘°
不思量自难忘° 2021-01-14 23:48

I have a parquet file stored in S3 bucket. I want to get the list of all columns of the parquet file. I am using s3 select but it just give me list of all rows

相关标签:
1条回答
  • 2021-01-15 00:12

    I have the same problem but unfortunately, my Google-Kung-Fu was not strong enough this time.

    I found the following workaround which I don't really like but it works for me:

    r = s3.select_object_content(Bucket='...your bucket...', 
                                 Key='...your key...', 
                                 ExpressionType='SQL', 
                                 Expression="select s.* from S3Object s limit 1",
                                 InputSerialization={'Parquet': {}}, 
                                  OutputSerialization={'JSON': {}})
    row = json.loads([rec['Records']['Payload'].decode('utf-8') for rec in r['Payload'] if 'Records' in rec][0])
    
    print("Columns: ", list(row.keys()))
    

    I.e. what the code does is to request the first line of the data, to extract the payload and to load the returned JSON object. The obtained JSON object has the following structure {"Column name": "value", ....}, such that one only has to extract the keys of the JSON object (last line).

    One additional problem is that this does not return the types of columns. This is something that I could not solve yet.

    UPDATE: I observed that in some situations, some column names were not correct reported. Instead of the real name something like _18, _19 were returned. No idea how to deal with it.

    0 讨论(0)
提交回复
热议问题