Split a large json file into multiple smaller files

前端 未结 4 1423
臣服心动
臣服心动 2021-02-04 08:21

I have a large JSON file, about 5 million records and a file size of about 32GB, that I need to get loaded into our Snowflake Data Warehouse. I need to get this file broken up i

4条回答
  •  悲&欢浪女
    2021-02-04 08:38

    Snowflake has a very special treatment for JSON and if we understand them, it would be easy to draw the design.

    1. JSON/Parquet/Avro/XML is considered as semi-structure data
    2. They are stored as Variant data type in Snowflake.
    3. While loading the JSON data into stage location, flag the strip_outer_array=true

      copy into

      from @~/.json file_format = (type = 'JSON' strip_outer_array = true);

    4. Each row size can not exceed 16Mb compressed when loaded in snowflake.

    5. Snowflake data loading works well if the file size is split in the range of 10-100Mb in size.
    6. Use the utilities which can split the file based on per line and have the file size note more than 100Mb and that brings the power of parallelism as well as accuracy for your data.

      As per your data set size, you will get around 31K small files (of 100Mb size).

      • It means that the 31k parallel process run, however, it is not possible.
      • So choose an x-large size warehouse (16 v-core & 32 threads)
      • 31k/32 = (approximately) 1000 rounds
      • This will not take more than a few minutes to load data based on your network bandwidth. Even if we think of 3sec per round, it may load the data in 50min.

      Look at the warehouse configuration & throughput details and refer semi-structured data loading best practice.

      提交回复
      热议问题