Python — Parsing files (docx, pdf and odt) and converting the content into my data model
I'm writing an import/export tool for importing docx, pdf, and odt files; in which a book has been written. We already have a tool for the .epub format, and we'd like to extend the functionality beyond that, so users of the site can have more flexibility. So far I've looked at PDFMiner and also found out that docx is just based on the openxml format, so the word/document.xml is essentially the file containing the whole thing, and I can parse it with lxml. The question I have is: I'm hoping to parse the contents of these files, and from that content, extract things like chapter names, images