问题
I am using this code to get large xml data in staging into snowflake, for which I have to use STRIP_OUTER_ELEMENT = TRUE, other wise error occurs:
Error parsing XML: document is too large, max size 16777216 bytes
COPY INTO SAMPLE_DB.SAMPLE_SCH.T_TABLE (CATALOG_XML)
FROM @META_DB.CONFIG.STAGESNOWFLAKE/catalogmain.xml
FILE_FORMAT=(TYPE=XML STRIP_OUTER_ELEMENT = TRUE)
ON_ERROR='CONTINUE';
ON THIS XML, which is very large
<catalog xmlns="http://www.demandware.com/xml/impex/catalog/2006-10-31" catalog-id="catalog1">
<product product-id="prod1">
..
..
..
</product>
</catalog>
AND I GET THIS RESULT AS CATALOG_XML: (not getting catalog tag data)
<product product-id="prod1">
..
..
..
</product>
I need to get catalog-id from this xml, is there any way to get this ?
回答1:
With STRIP_OUTER_ELEMENT = FALSE, the whole XML document will be read into a single value. Irrespective of being cast as a VARCHAR or VARIANT type, in a COPY INTO statement or in a direct file query, the 16 MiB size-limit in Snowflake for those data types cannot be bypassed.
You can try splitting your XML file to manageable sized pieces, preserving the parent tags, and uploading smaller files carrying the same amount of data.
For example, the following javascript program in Node.js can break the files into fixed numbers of product
elements per file, while preserving the outer catalog element in each file. This produces a directory of broken-down files, which can be loaded and queried without running into the 16 MiB limit.
~> mkdir /tmp/xsplt /tmp/split_files && cd /tmp/xsplt
~> npm init && npm install xmlsplit
~> cat > splitter.js << EOF
var xs = require('xmlsplit')
var fs = require('fs')
// Split into 500 product element data per document
var xmlsplit = new xs(batchSize=500)
var inputStream = fs.createReadStream("/tmp/large-input.xml")
var counter = 1;
inputStream.pipe(xs).on('data', function(data) {
var xmlDocument = data.toString()
fs.writeFile(
`/tmp/split_files/output-part-${counter}.xml`,
xmlDocument,
(err) => { if (err) { console.log(err) } })
counter += 1
})
EOF
~> node splitter.js
# Original input line count
~> wc -l /tmp/large-input.xml
500002
# Lines per split file
~> wc -l /tmp/split_files/output-part-1.xml
502
# No. of smaller files
~> ls /tmp/split_files/ | wc -l
1000
Other alternatives could be to try a CSV export of the data instead of XML, or perform a local file format conversion before loading the data into Snowflake.
P.s. For help with loading and querying the split files, checkout the earliest revision of this answer.
来源:https://stackoverflow.com/questions/62037868/how-to-get-xml-header-tag-data-in-snowflake-for-a-large-xml-while-using-strip