how to get xml header tag data in snowflake for a large xml , while using STRIP_OUTER_ELEMENT = TRUE

拟墨画扇 提交于 2020-12-15 11:48:38

问题


I am using this code to get large xml data in staging into snowflake, for which I have to use STRIP_OUTER_ELEMENT = TRUE, other wise error occurs:

Error parsing XML: document is too large, max size 16777216 bytes

COPY INTO SAMPLE_DB.SAMPLE_SCH.T_TABLE (CATALOG_XML)
FROM @META_DB.CONFIG.STAGESNOWFLAKE/catalogmain.xml
FILE_FORMAT=(TYPE=XML STRIP_OUTER_ELEMENT = TRUE) 
ON_ERROR='CONTINUE';

ON THIS XML, which is very large

<catalog xmlns="http://www.demandware.com/xml/impex/catalog/2006-10-31" catalog-id="catalog1">
    <product product-id="prod1">
     ..
     ..
     ..
    </product>
</catalog>

AND I GET THIS RESULT AS CATALOG_XML: (not getting catalog tag data)

<product product-id="prod1">
 ..
 ..
 ..
</product>

I need to get catalog-id from this xml, is there any way to get this ?


回答1:


With STRIP_OUTER_ELEMENT = FALSE, the whole XML document will be read into a single value. Irrespective of being cast as a VARCHAR or VARIANT type, in a COPY INTO statement or in a direct file query, the 16 MiB size-limit in Snowflake for those data types cannot be bypassed.

You can try splitting your XML file to manageable sized pieces, preserving the parent tags, and uploading smaller files carrying the same amount of data.

For example, the following javascript program in Node.js can break the files into fixed numbers of product elements per file, while preserving the outer catalog element in each file. This produces a directory of broken-down files, which can be loaded and queried without running into the 16 MiB limit.

~> mkdir /tmp/xsplt /tmp/split_files && cd /tmp/xsplt
~> npm init && npm install xmlsplit

~> cat > splitter.js << EOF
var xs = require('xmlsplit')
var fs = require('fs')

// Split into 500 product element data per document
var xmlsplit = new xs(batchSize=500)
var inputStream = fs.createReadStream("/tmp/large-input.xml")

var counter = 1;
inputStream.pipe(xs).on('data', function(data) {
    var xmlDocument = data.toString()
    fs.writeFile(
        `/tmp/split_files/output-part-${counter}.xml`,
        xmlDocument,
        (err) => { if (err) { console.log(err) } })
    counter += 1
})
EOF

~> node splitter.js

# Original input line count
~> wc -l /tmp/large-input.xml
500002
# Lines per split file
~> wc -l /tmp/split_files/output-part-1.xml
502
# No. of smaller files
~> ls /tmp/split_files/ | wc -l 
1000

Other alternatives could be to try a CSV export of the data instead of XML, or perform a local file format conversion before loading the data into Snowflake.

P.s. For help with loading and querying the split files, checkout the earliest revision of this answer.



来源:https://stackoverflow.com/questions/62037868/how-to-get-xml-header-tag-data-in-snowflake-for-a-large-xml-while-using-strip

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!