How to efficiently parse large bz2 xml file in C

徘徊边缘 提交于 2019-12-12 04:54:08

问题


What I want to do:

  • download OSM (OpenStreetMap) data in regular intervals (or update it using diffs)
  • parse that data, which is an bzip2 compressed xml, and store the parts relevant to me in my database, as memory+cpu efficient as possible (runtime is not thaaaat much of an issue)

What I have:

  • xxx.osm.bz2 file (bzip2 compressed xml), compressed 29GB, uncompressed about 400GB
  • software is running on debian linux, no vm or anything involved

Specific questions, to elaborate what my problem is:

  • I've found bzip2 file stream libraries for c++, but not how to deal with this in c (sequentielly decompressing data and using (parsing in my case) it at the same time), how should I go about this?
  • libxml2 and all other c usable libraries dealing with xml I've found parse the whole xml and let you work on it afterwards, but I don't really want to have a several GB xml in memory just for filtering it down sequentially - am I wrong about libxml2 and it actually has such a functionality? Or is there a different library I can use?
  • Maybe there even is a more high level library to do this which is already specialiced for OSM data? Couldn't find anything like that, and the tools they provide don't really help (I don't plan on filtering the data first with Osmosium or the like and then filter it again with my code, that would be extremely inefficient I think)

I hope I have been able to clearly state my question, and I would be very thankful if someone could at least point me the right direction(s).

Thank you very much.


Update: Right after posting this I found out that libxml2 actually provides xmlTextReader from version 2.5.0 onwards, which partly addresses my question - but only partly, as I still don't know how to combine that with sequential bz2 file reading (and am open to totally different solutions still of course).


Update 2: The solution has to work from a permanently running process, and should be (as stated on point 2) memory+cpu efficient, so besides anything else there data should not be copied dozens of times (in memory or on disk).


回答1:


You don't do bzip2 decompression in your program, just read uncompressed xml from stdin and parse it with libxml2 (or equvalent). Then just call your program like this, and enjoy the beuty of unix pipes:

bzip2 -d < planet.osm.bzip2 | yourtool


来源:https://stackoverflow.com/questions/18469714/how-to-efficiently-parse-large-bz2-xml-file-in-c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!