Nested data in Parquet with Python

后端 未结 3 2074
清酒与你
清酒与你 2021-02-07 09:17

I have a file that has one JSON per line. Here is a sample:

{
    \"product\": {
        \"id\": \"abcdef\",
        \"price\": 19.99,
        \"specs\": {
              


        
3条回答
  •  终归单人心
    2021-02-07 09:36

    Implementing the conversions on both the read and write path for arbitrary Parquet nested data is quite complicated to get right -- implementing the shredding and reassembly algorithm with associated conversions to some Python data structures. We have this on the roadmap in Arrow / parquet-cpp (see https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow), but it has not been completed yet (only support for simple structs and lists/arrays are supported now). It is important to have this functionality because other systems that use Parquet, like Impala, Hive, Presto, Drill, and Spark, have native support for nested types in their SQL dialects, so we need to be able to read and write these structures faithfully from Python.

    This can be analogously implemented in fastparquet as well, but it's going to be a lot of work (and test cases to write) no matter how you slice it.

    I will likely take on the work (in parquet-cpp) personally later this year if no one beats me to it, but I would love to have some help.

提交回复
热议问题