I\'m dealing with this kind of XML sequence file can you any one suggest me to parse this:
That file contains a sequence of XML documents concatenated to each other. You need to register a PHP streamwrapper that transparently divides the file for you, then you can process each document individually and even in a streaming fashion. Example:
stream_wrapper_register('xmlseq', 'XMLSequenceStream');
$path = "xmlseq://zip://ipg140107.zip#ipg140107.xml";
while (XMLSequenceStream::notAtEndOfSequence($path)) {
$reader = new XMLReader();
$reader->open($path);
// just consume the whole document
while ($reader::next()) {
XMLReaderNode::dump($reader);
}
}
XMLSequenceStream::clean();
That stream-wrapper is part of the XMLReaderIterator library and works as well with SimpleXMLElement or DOMDocument albeit for larger files XMLReader is a better fit.
For the file I've taken in my example (http://storage.googleapis.com/patents/grant_full_text/2014/ipg140107.zip from https://www.google.com/googlebooks/uspto-patents-grants-text.html), the overall element-structure counting elements of the different trees in that sequence for example is:
\-us-patent-grant (473)
|-us-bibliographic-data-grant (473)
| |-publication-reference (473)
| | \-document-id (473)
| | |-country (473)
| | |-doc-number (473)
| | |-kind (473)
| | \-date (473)
| |-application-reference (473)
| | \-document-id (473)
| | |-country (473)
| | |-doc-number (473)
| | \-date (473)
| |-us-application-series-code (473)
| |-us-term-of-grant (470)
| | |-length-of-grant (450)
| | |-disclaimer (18)
| | | \-text (18)
| | \-us-term-extension (20)
| |-classification-locarno (450)
| | |-edition (450)
| | \-main-classification (450)
| |-classification-national (473)
| | |-country (473)
| | |-main-classification (473)
| | \-further-classification (143)
| |-invention-title (473)
| | \-i (12)
| |-us-references-cited (458)
| | \-us-citation (11000)
| | |-patcit (10265)
| | | \-document-id (10265)
| | | |-country (10265)
| | | |-doc-number (10265)
| | | |-kind (9884)
| | | |-name (9811)
| | | \-date (10264)
| | |-category (10999)
| | |-classification-national (6309)
| | | |-country (6309)
| | | \-main-classification (6309)
| | |-nplcit (735)
| | | \-othercit (735)
| | | |-sub (281)
| | | |-i (7)
| | | \-sup (1)
| | \-classification-cpc-text (1)
| |-number-of-claims (472)
| |-us-exemplary-claim (472)
| |-us-field-of-classification-search (472)
| | \-classification-national (8991)
| | |-country (8991)
| | |-main-classification (8991)
| | \-additional-info (1205)
| |-figures (472)
| | |-number-of-drawing-sheets (472)
| | \-number-of-figures (472)
| |-us-parties (472)
| | |-us-applicants (472)
| | | \-us-applicant (765)
| | | |-addressbook (765)
| | | | |-last-name (573)
| | | | |-first-name (573)
| | | | |-address (765)
| | | | | |-city (765)
| | | | | |-country (765)
| | | | | \-state (423)
| | | | \-orgname (192)
| | | \-residence (765)
| | | \-country (765)
| | |-inventors (472)
| | | \-inventor (969)
| | | \-addressbook (969)
| | | |-last-name (969)
| | | |-first-name (969)
| | | \-address (969)
| | | |-city (969)
| | | |-country (969)
| | | \-state (519)
| | \-agents (429)
| | \-agent (500)
| | \-addressbook (500)
| | |-orgname (361)
| | |-address (500)
| | | \-country (500)
| | |-last-name (139)
| | \-first-name (139)
| |-assignees (385)
| | \-assignee (391)
| | |-addressbook (390)
| | | |-orgname (386)
| | | |-role (390)
| | | |-address (390)
| | | | |-city (355)
| | | | |-country (390)
| | | | \-state (192)
| | | |-last-name (4)
| | | \-first-name (4)
| | |-orgname (1)
| | \-role (1)
| |-examiners (472)
| | |-primary-examiner (472)
| | | |-last-name (472)
| | | |-first-name (472)
| | | \-department (472)
| | \-assistant-examiner (65)
| | |-last-name (65)
| | \-first-name (65)
| |-us-related-documents (65)
| | |-continuation-in-part (16)
| | | \-relation (16)
| | | |-parent-doc (16)
| | | | |-document-id (16)
| | | | | |-country (16)
| | | | | |-doc-number (16)
| | | | | \-date (16)
| | | | |-parent-status (11)
| | | | \-parent-grant-document (5)
| | | | \-document-id (5)
| | | | |-country (5)
| | | | |-doc-number (5)
| | | | \-date (2)
| | | \-child-doc (16)
| | | \-document-id (16)
| | | |-country (16)
| | | \-doc-number (16)
| | |-continuation (21)
| | | \-relation (21)
| | | |-parent-doc (21)
| | | | |-document-id (21)
| | | | | |-country (21)
| | | | | |-doc-number (21)
| | | | | \-date (21)
| | | | |-parent-status (16)
| | | | \-parent-grant-document (5)
| | | | \-document-id (5)
| | | | |-country (5)
| | | | |-doc-number (5)
| | | | \-date (2)
| | | \-child-doc (21)
| | | \-document-id (21)
| | | |-country (21)
| | | \-doc-number (21)
| | |-division (32)
| | | \-relation (32)
| | | |-parent-doc (32)
| | | | |-document-id (32)
| | | | | |-country (32)
| | | | | |-doc-number (32)
| | | | | \-date (32)
| | | | |-parent-grant-document (24)
| | | | | \-document-id (24)
| | | | | |-country (24)
| | | | | |-doc-number (24)
| | | | | \-date (1)
| | | | \-parent-status (8)
| | | \-child-doc (32)
| | | \-document-id (32)
| | | |-country (32)
| | | \-doc-number (32)
| | \-related-publication (9)
| | \-document-id (9)
| | |-country (9)
| | |-doc-number (9)
| | |-kind (9)
| | \-date (9)
| |-priority-claims (140)
| | \-priority-claim (182)
| | |-country (182)
| | |-doc-number (182)
| | \-date (182)
| |-us-sir-flag (1)
| |-classifications-ipcr (23)
| | \-classification-ipcr (24)
| | |-ipc-version-indicator (24)
| | | \-date (24)
| | |-classification-level (24)
| | |-section (24)
| | |-class (24)
| | |-subclass (24)
| | |-main-group (24)
| | |-subgroup (24)
| | |-symbol-position (24)
| | |-classification-value (24)
| | |-action-date (24)
| | | \-date (24)
| | |-generating-office (24)
| | | \-country (24)
| | |-classification-status (24)
| | \-classification-data-source (24)
| |-us-botanic (21)
| | |-latin-name (21)
| | \-variety (21)
| \-classifications-cpc (1)
| \-main-cpc (1)
| \-classification-cpc (1)
| |-cpc-version-indicator (1)
| | \-date (1)
| |-section (1)
| |-class (1)
| |-subclass (1)
| |-main-group (1)
| |-subgroup (1)
| |-symbol-position (1)
| |-classification-value (1)
| |-action-date (1)
| | \-date (1)
| |-generating-office (1)
| | \-country (1)
| |-classification-status (1)
| |-classification-data-source (1)
| \-scheme-origination-code (1)
|-drawings (472)
| \-figure (3033)
| \-img (3033)
|-description (472)
| |-description-of-drawings (472)
| | |-p (3955)
| | | |-figref (4478)
| | | |-b (86)
| | | \-i (6)
| | \-heading (22)
| |-heading (162)
| \-p (340)
| |-figref (15)
| |-b (250)
| |-i (146)
| |-ul (96)
| | \-li (444)
| | |-ul (215)
| | | \-li (273)
| | | |-ul (199)
| | | | \-li (1192)
| | | | |-i (1219)
| | | | |-b (1)
| | | | |-sup (10)
| | | | \-sub (2)
| | | \-i (11)
| | |-sup (2)
| | \-i (26)
| |-tables (15)
| | \-table (15)
| | \-tgroup (49)
| | |-colspec (175)
| | |-thead (15)
| | | \-row (27)
| | | \-entry (51)
| | \-tbody (49)
| | \-row (291)
| | \-entry (997)
| | \-sup (28)
| \-sup (2)
|-us-claim-statement (472)
|-claims (472)
| \-claim (476)
| \-claim-text (476)
| |-figref (1)
| |-claim-text (5)
| |-claim-ref (4)
| \-i (15)
\-abstract (22)
\-p (22)
|-i (27)
\-ul (2)
\-li (2)
\-ul (2)
\-li (11)
That's not a valid XML file. It looks like two files in one, but even then it is invalid. Assuming those are two separate files, you could try "tidying" them first. Assuming $xml is a string containing the xml contents:
$xml = tidy_repair_string($xml, array(
'output-xml' => true,
'input-xml' => true
));
Then you could use SimpleXml on it:
$xml = new SimpleXmlElement($xml);
I know where this XML file has come from and I find it quite strange that Google would provide some invalid XML (unless they are simply just hosting this file that they got from somewhere else). This suggestion for parsing it worked for me: How to parse an xml file with multiple xml declaration using PHP? (A concatenation of several XML files)