Split a large json file into multiple smaller files

前端未结

关注

 4  1431

I have a large JSON file, about 5 million records and a file size of about 32GB, that I need to get loaded into our Snowflake Data Warehouse. I need to get this file broken up i

相关标签:

4条回答

慢半拍i

2021-02-04 08:31
Use this code in linux command prompt
```
split -b 53750k <your-file>
cat xa* > <your-file>
```
Refer to this link: https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2021-02-04 08:34
Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.

If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:
- https://www.npmjs.com/package/JSONStream
- https://www.npmjs.com/package/stream-json
- https://www.npmjs.com/package/json-stream
If you go with Python, there are streaming JSON parsers here as well:
- https://github.com/kashifrazzaqui/json-streamer
- https://github.com/danielyule/naya
- http://www.enricozini.org/blog/2011/tips/python-stream-json/
0 讨论(0)
发布评论:

提交评论
- 加载中...
慢半拍i

2021-02-04 08:38
consider to use jq to preprocessing your json files

it could split and stream your large json files
```
jq is like sed for JSON data - you can use it to slice 
and filter and map and transform structured data with 
the same ease that sed, awk, grep and friends let you play with text.
```
see the official documentation and this questions for more.

extra: for your first questions jq is written by C, it's faster than python/node isn't it ?
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2021-02-04 08:38
Snowflake has a very special treatment for JSON and if we understand them, it would be easy to draw the design.
1. JSON/Parquet/Avro/XML is considered as semi-structure data
2. They are stored as Variant data type in Snowflake.
3. While loading the JSON data into stage location, flag the strip_outer_array=true
  
  copy into <table> from @~/<file>.json file_format = (type = 'JSON' strip_outer_array = true);
4. Each row size can not exceed 16Mb compressed when loaded in snowflake.
5. Snowflake data loading works well if the file size is split in the range of 10-100Mb in size.
Use the utilities which can split the file based on per line and have the file size note more than 100Mb and that brings the power of parallelism as well as accuracy for your data.

As per your data set size, you will get around 31K small files (of 100Mb size).
- It means that the 31k parallel process run, however, it is not possible.
- So choose an x-large size warehouse (16 v-core & 32 threads)
- 31k/32 = (approximately) 1000 rounds
- This will not take more than a few minutes to load data based on your network bandwidth. Even if we think of 3sec per round, it may load the data in 50min.
Look at the warehouse configuration & throughput details and refer semi-structured data loading best practice.
0 讨论(0)
发布评论:

提交评论
- 加载中...