I have a 15 GB XML file which I would want to split it .It has approximately 300 Million lines in it . It doesn\'t have any top nodes which are interdependent .Is there any
The open source library comma has several tools to find data in very large XMl files and to split those files into smaller files.
https://github.com/acfr/comma/wiki/XML-Utilities
The tools were built using the expat SAX parser so that they did not fill memory with a DOM tree like xmlstarlet and saxon.
Used this for splitting Yahoo Q&A dataset
count = 0
file_count = 1
with open('filepath') as f:
current_file = ""
for line in f:
current_file = current_file + line
if "</your tag to split>" in line:
count = count + 1
if count==50000:
current_file = current_file + "</endTag>"
with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
split.write(current_file)
file_count = file_count + 1
current_file = "<?xml version='1.0' encoding='UTF-8'?>\n<endTag>"
count = 0
current_file = current_file + "</endTag>"
with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
split.write(current_file)
I used XmlSplit Wizard tool. It really work nicely and you can specify the split method like element, rows, number of files, or the size of files. The only problem is that I had to buy it for 99$ as the trial version wont allow you to all split data, only odd number of divided files. I was able to split a 70GB file !
XmlSplit - A Command-line Tool That Splits Large XML Files
xml_split - split huge XML documents into smaller chunks
Split that XML by bhayanakmaut (No source code and I could not get this one working)
A similar question: How do I split a large xml file?