I have a 15 GB XML file which I would want to split it .It has approximately 300 Million lines in it . It doesn\'t have any top nodes which are interdependent .Is there any
I think you'll have to split manually unless you are interested in doing it programmatically. Here's a sample that does that, though it doesn't mention the max size of handled XML files. When doing it manually, the first problem that arises is how to open the file itself.
I would recommend a very simple text editor - something like Vim. When handling such large files, it is always useful to turn off all forms of syntax highlighting and/or folding.
Other options worth considering:
EditPadPro - I've never tried it with anything this size, but if it's anything like other JGSoft products, it should work like a breeze. Remember to turn off syntax highlighting.
VEdit - I've used this with files of 1GB in size, works as if it were nothing at all.
EmEditor
Not an Xml tool but Ultraedit could probably help, I've used it with 2G files and it didn't mind at all, make sure you turn off the auto-backup feature though.
Perhaps this question is actual still and I believe it can help somebody. There is an xml editor XiMpLe which contains a tool for splitting big files. Only fragment size is required. And there is also reverse functionality to link xml files together(!). It's free for non-commercial use and the license is not expensive too. No installation is required. For me it worked very good (I had 5GB file).
In what way do you need to split it? It's pretty easy to write code using XmlReader.ReadSubTree
. It will return a new xmlReader instance against the current element and all its child elements. So, move to the first child of the root, call ReadSubtree, write all those nodes, call Read() using the original reader, and loop until done.
Here is a low memory footprint script to do it in the free firstobject XML editor (foxe) using CMarkup file mode. I am not sure what you mean by no interdependent top nodes, or tag checking, but assuming under the root element you have millions of top level elements containing object properties or rows that each need to be kept together as a unit, and you wanted say 1 million per output file, you could do this:
split_xml_15GB() { int nObjectCount = 0, nFileCount = 0; CMarkup xmlInput, xmlOutput; xmlInput.Open( "15GB.xml", MDF_READFILE ); xmlInput.FindElem(); // root str sRootTag = xmlInput.GetTagName(); xmlInput.IntoElem(); while ( xmlInput.FindElem() ) { if ( nObjectCount == 0 ) { ++nFileCount; xmlOutput.Open( "piece" + nFileCount + ".xml", MDF_WRITEFILE ); xmlOutput.AddElem( sRootTag ); xmlOutput.IntoElem(); } xmlOutput.AddSubDoc( xmlInput.GetSubDoc() ); ++nObjectCount; if ( nObjectCount == 1000000 ) { xmlOutput.Close(); nObjectCount = 0; } } if ( nObjectCount ) xmlOutput.Close(); xmlInput.Close(); return nFileCount; }
I posted a youtube video and article about this here:
http://www.firstobject.com/xml-splitter-script-video.htm
QXMLEdit has a dedicated function for that: I used it successfully with a Wikipedia dump. The ~2.7Gio file became a bunch of ~1 400 000 files (one per page). It even allows you to dispatch them in subfolders.