I am trying to parse the stackoverflow dump file (Posts.xml- 17gb) .It is of the form:
.
Using PHP xmlreader seems to be the right thing to do.
Reason: Because of your statement:
I have to 'group' each question with their answers. Basically find a question (posttypeid=1) find its answers using parentId of another row and store it in db.
What I understand is you like to build a database with questions an answers. Therefore, there is no reason to do the "grouping" on the XML level. Put all relevant information in the database and do the grouping on the DB level - with db commands (sql ...).
What you have to is use something like "Using the target parser method" E.g [High-performance XML parsing in Python with xml (Even if it is for Python, it's a good start). This should be possible with XMLReader.
I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?
Yes you are wrong. With XMLReader you specify your own how often your want to traverse the file (you normally do it once). For your case I see no reason why you should not be able to even insert this 1:1 on each <row>
element. You can decide per the attribute which database (table?) you would like to insert into.
I normally suggest a set of Iterators that make traversing with XMLReader easier. It's called XMLReaderIterator and allows to foreach
over the XMLReader
so that the code is often easier to read and write:
$reader = new XMLReader();
$reader->open($xmlFile);
/* @var $users XMLReaderNode[] - iterate over all <post><row> elements */
$posts = new XMLElementIterator($reader, 'row');
foreach ($posts as $post)
{
$isAnswerInsteadOfQuestion = (bool)$post->getAttribute('ParentId')
$importer = $isAnswerInsteadOfQuestion
? $importerAnswers
: $importerQuestions;
$importer->importRowNode($post);
}
If you are concerned about the order (e.g. you might fear that some answers parent's aren't available while the answers are), I would take care inside the importer layer, not inside the traversal.
Depending if that happens often, very often, never or quite never I would use a different strategy. E.g. for never I would insert directly into database tables with foreign key constraints activated. If often, I would create an insert transaction for the whole import in which the key constraints are lifted and re-activated at the end.
Because the way you are processing this large file isn't sequential but requires direct access, I think the only viable option is to load the data into an XML database.