Parsing extremely large XML files in php

前端 未结 2 943
花落未央
花落未央 2020-12-03 15:33

I need to parse XML files of 40GB in size, and then normalize, and insert to a MySQL database. How much of the file I need to store in the database is not clear, neither do

相关标签:
2条回答
  • 2020-12-03 16:01

    It would be nice to know what you actually intend to do with the XML. The way you parse it depends very much on the processing you need to carry out, as well as the size.

    If this is a one-off task, then I've started in the past by discovering the XML structure before doing anything else. My DTDGenerator (see saxon.sf.net) was written for this purpose a long time ago and still does the job, there are other tools available now but I don't know whether they do streamed processing which is a prerequisite here.

    You can write an application that processes the data using either a pull or push streamed parser (SAX or StAX). How easy this is depends on how much processing you have to do and how much state you have to maintain, which you haven't told us. Alternatively you could try streamed XSLT processing, which is available in Saxon-EE.

    0 讨论(0)
  • 2020-12-03 16:20

    In PHP, you can read in extreme large XML files with the XMLReaderDocs:

    $reader = new XMLReader();
    $reader->open($xmlfile);
    

    Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz.

    PHP supports that quite well with XMLReader via the compression wrappersDocs:

    $xmlfile = 'compress.zlib://path/to/large.xml.gz';
    
    $reader = new XMLReader();
    $reader->open($xmlfile);
    

    The XMLReader allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.

    I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.

    See as well:

    • PHP open gzipped XML
    0 讨论(0)
提交回复
热议问题