open-uri and sax parsing for a giant xml document

被刻印的时光 ゝ 提交于 2019-12-07 23:33:22

问题


I need to connect to an external XML file to download and process (300MB+). Then run through the XML document and save elements in the database.

I am already doing this no problem on a production server with Saxerator to be gentle on memory. It works great. Here is my issue now --

I need to use open-uri (though there could be alternative solutions?) to grab the file to parse through. This problem is that open-uri has to load the whole file before anything starts parsing, which defeats the entire purpose of using a SAX Parser to save on memory... any work arounds? Can I just read from the external XML document? I cannot load the entire file or it crashes my server, and since the document is updated every 30 minutes, I can't just save a copy of it on my server (though this is what I am doing currently to make sure everything id working).

I am doing this Ruby, p.s.


回答1:


You may want to try Net::HTTP's streaming interface instead of open-URI. This will give Saxerator (via the underlying Nokogiri::SAX::Parser) an IO object rather than the entire file.




回答2:


I took a few minutes to write this up and then realized you tagged this question with ruby. My solution is in Java so I apologize for that. I'm still including it here since it could be useful to you or someone down the road.

This is always how I've processed large external xml files

XMLReader xmlReader = SAXParserFactory.newInstance().newSAXParser().getXMLReader();
xmlReader.setFeature("http://xml.org/sax/features/namespaces", true);
XMLFilter filter = new XMLFilterImpl();

filter.setParent(xmlReader);

filter.parse(new InputSource(new BufferedReader(new InputStreamReader(new URL("<url to external document here>").openConnection().getInputStream(),"UTF8"))));


来源:https://stackoverflow.com/questions/21634792/open-uri-and-sax-parsing-for-a-giant-xml-document

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!