I have a large JSON file, about 5 million records and a file size of about 32GB, that I need to get loaded into our Snowflake Data Warehouse. I need to get this file broken up i
Use this code in linux command prompt
split -b 53750k <your-file>
cat xa* > <your-file>
Refer to this link: https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file
Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.
If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:
If you go with Python, there are streaming JSON parsers here as well:
consider to use jq to preprocessing your json files
it could split and stream your large json files
jq is like sed for JSON data - you can use it to slice
and filter and map and transform structured data with
the same ease that sed, awk, grep and friends let you play with text.
see the official documentation and this questions for more.
extra: for your first questions jq is written by C, it's faster than python/node isn't it ?
Snowflake has a very special treatment for JSON and if we understand them, it would be easy to draw the design.
While loading the JSON data into stage location, flag the strip_outer_array=true
copy into <table>
from @~/<file>.json
file_format = (type = 'JSON' strip_outer_array = true);
Each row size can not exceed 16Mb compressed when loaded in snowflake.
Use the utilities which can split the file based on per line and have the file size note more than 100Mb and that brings the power of parallelism as well as accuracy for your data.
As per your data set size, you will get around 31K small files (of 100Mb size).
Look at the warehouse configuration & throughput details and refer semi-structured data loading best practice.