问题
I'm setting up a snowpipe to load data from s3 bucket to snowflake schema. S3 contains files in NDJOSN format. One file can contain multiple records and I want to process all of them. Even if one record is broken.
To do so, I need to add on_error='continue'
option to pipe creation and use csv file format as stated in official snowflake docs here.
That way I receive raw strings of JSON that I need to parse to access data. And since snowpipes do not support nested selects the only way to do that is to parse it for each column individually.
Resulting in this copy statement:
copy into MY_TABLE
from (select parse_json($1):id, parse_json($1):name, parse_json($1):status
from @MY_STAGE_CONNECTED_TO_S3)
on_error = 'continue'
This code needs to parse json 3 times for each row.
I have a table with ~40 col's and for it, this query works ~5 times slower than a more straightforward solution that uses file format option to parse JSON but unfortunately does not support on_error=continue
option.
copy into HQO_DEVELOPMENT.PUBLIC.DIM_TENANT
from (select $1:id, $1:name, $1:status
from @HQO_DEVELOPMENT.PUBLIC.DIM_TENANT_STAGE_NN)
file_format = (type = 'json')
What was tried
- Using nested select like this: : is not supported
copy into HQO_DEVELOPMENT.PUBLIC.DIM_TENANT from
(select $1:id, $1:name, $1:status from (
select parse_json($1) from
@HQO_DEVELOPMENT.PUBLIC.DIM_TENANT_STAGE))
on_error = 'continue'
- Using JSON type on stage and omit on the pipe: won't help the issue
Is there a way to have the benefit of on_error=continue
and don't parse JSON for every column?
回答1:
Snowflake documentation says: Both CSV and semi-structured file types are supported; however, even when loading semi-structured data (e.g. JSON), you should set CSV as the file format type (default value). You can use the corresponding file format (e.g. JSON), but any error in the transformation will stop the COPY operation, even if you set the ON_ERROR option to continue or skip the file.
https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-table.html
On the other hand, I see on_error option works for NDJSON files, at least when you set file type on stage level. For testing,
I created the following NDJSON file for testing:
{ "id":1, "name":"Gokhan", "location":"Netherlands" }
{ "id":2, "name":"Hans", location:Germany -- broken json #1 }
{ "id":3, "name":"Joe", "location":"UK" }
{ broken json #2 }
{ "id":4, "name":"Mike", "location":"US" }
I created a file type object, and created a stage using this file type (you can alter your existing stage and set a file type):
CREATE FILE FORMAT myformat TYPE = json;
CREATE STAGE mystage FILE_FORMAT = (FORMAT_NAME = myformat);
I uploaded my sample NDJSON file to this stage, created a snowpipe to load it:
CREATE PIPE mypipe AS
COPY INTO mytable
FROM (SELECT $1:id, $1:name, $1:location FROM @mystage)
ON_ERROR = CONTINUE;
When I refreshed the pipe, it successfully loaded the "valid" (3) records from the file.
来源:https://stackoverflow.com/questions/60156843/how-to-parse-json-efficiently-in-snowpipe-with-on-error-continue