Split a large json file into multiple smaller files

前端 未结 4 1422
臣服心动
臣服心动 2021-02-04 08:21

I have a large JSON file, about 5 million records and a file size of about 32GB, that I need to get loaded into our Snowflake Data Warehouse. I need to get this file broken up i

相关标签:
4条回答
  • 2021-02-04 08:31

    Use this code in linux command prompt

    split -b 53750k <your-file>
    cat xa* > <your-file>
    

    Refer to this link: https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file

    0 讨论(0)
  • 2021-02-04 08:34

    Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.

    If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:

    • https://www.npmjs.com/package/JSONStream
    • https://www.npmjs.com/package/stream-json
    • https://www.npmjs.com/package/json-stream

    If you go with Python, there are streaming JSON parsers here as well:

    • https://github.com/kashifrazzaqui/json-streamer
    • https://github.com/danielyule/naya
    • http://www.enricozini.org/blog/2011/tips/python-stream-json/
    0 讨论(0)
  • 2021-02-04 08:38

    consider to use jq to preprocessing your json files

    it could split and stream your large json files

    jq is like sed for JSON data - you can use it to slice 
    and filter and map and transform structured data with 
    the same ease that sed, awk, grep and friends let you play with text.
    

    see the official documentation and this questions for more.

    extra: for your first questions jq is written by C, it's faster than python/node isn't it ?

    0 讨论(0)
  • 2021-02-04 08:38

    Snowflake has a very special treatment for JSON and if we understand them, it would be easy to draw the design.

    1. JSON/Parquet/Avro/XML is considered as semi-structure data
    2. They are stored as Variant data type in Snowflake.
    3. While loading the JSON data into stage location, flag the strip_outer_array=true

      copy into <table> from @~/<file>.json file_format = (type = 'JSON' strip_outer_array = true);

    4. Each row size can not exceed 16Mb compressed when loaded in snowflake.

    5. Snowflake data loading works well if the file size is split in the range of 10-100Mb in size.

    Use the utilities which can split the file based on per line and have the file size note more than 100Mb and that brings the power of parallelism as well as accuracy for your data.

    As per your data set size, you will get around 31K small files (of 100Mb size).

    • It means that the 31k parallel process run, however, it is not possible.
    • So choose an x-large size warehouse (16 v-core & 32 threads)
    • 31k/32 = (approximately) 1000 rounds
    • This will not take more than a few minutes to load data based on your network bandwidth. Even if we think of 3sec per round, it may load the data in 50min.

    Look at the warehouse configuration & throughput details and refer semi-structured data loading best practice.

    0 讨论(0)
提交回复
热议问题