Nested data in Parquet with Python

后端 未结 3 2072
清酒与你
清酒与你 2021-02-07 09:17

I have a file that has one JSON per line. Here is a sample:

{
    \"product\": {
        \"id\": \"abcdef\",
        \"price\": 19.99,
        \"specs\": {
              


        
3条回答
  •  清酒与你
    2021-02-07 09:42

    This is not exactly the right answer, but it can helps.

    We could try to convert your dictionary to a pandas DataFrame, and after this write this to .parquet file:

    import pandas as pd
    from fastparquet import write, ParquetFile
    
    d = {
        "product": {
            "id": "abcdef",
            "price": 19.99,
            "specs": {
                "voltage": "110v",
                "color": "white"
            }
        },
        "user": "Daniel Severo"
    }
    
    df_test = pd.DataFrame(d)
    write('file_test.parquet', df_test)
    

    This would raise and error:

    ValueError: Can't infer object conversion type: 0                                   abcdef
    1                                    19.99
    2    {'voltage': '110v', 'color': 'white'}
    Name: product, dtype: object
    

    So a easy solution is to convert the product column to lists:

    df_test['product'] = df_test['product'].apply(lambda x: [x])
    
    # this should now works
    write('file_test.parquet', df_test)
    
    # and now compare the file with the initial DataFrame
    ParquetFile('file_test.parquet').to_pandas().explode('product')
        index            product                                 user
    0   id               abcdef                             Daniel Severo
    1   price             19.99                             Daniel Severo
    2   specs   {'voltage': '110v', 'color': 'white'}       Daniel Severo
    

提交回复
热议问题