Parsing extremely large XML files in php

落爺英雄遲暮 提交于 2019-12-28 04:15:16

问题


I need to parse XML files of 40GB in size, and then normalize, and insert to a MySQL database. How much of the file I need to store in the database is not clear, neither do I know the XML structure.

Which parser should I use, and how would you go about doing this?


回答1:


In PHP, you can read in extreme large XML files with the XMLReaderDocs:

$reader = new XMLReader();
$reader->open($xmlfile);

Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz.

PHP supports that quite well with XMLReader via the compression wrappersDocs:

$xmlfile = 'compress.zlib://path/to/large.xml.gz';

$reader = new XMLReader();
$reader->open($xmlfile);

The XMLReader allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.

I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.

See as well:

  • PHP open gzipped XML



回答2:


It would be nice to know what you actually intend to do with the XML. The way you parse it depends very much on the processing you need to carry out, as well as the size.

If this is a one-off task, then I've started in the past by discovering the XML structure before doing anything else. My DTDGenerator (see saxon.sf.net) was written for this purpose a long time ago and still does the job, there are other tools available now but I don't know whether they do streamed processing which is a prerequisite here.

You can write an application that processes the data using either a pull or push streamed parser (SAX or StAX). How easy this is depends on how much processing you have to do and how much state you have to maintain, which you haven't told us. Alternatively you could try streamed XSLT processing, which is available in Saxon-EE.



来源:https://stackoverflow.com/questions/15363901/parsing-extremely-large-xml-files-in-php

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!