I am developing a full text search engine for indexing popular binary formats. I know that there are hundereds of such questions (and solutions) already, but I found it toug
If at server side you can use OpenOffice then you can use unoconv: Convert between any document format supported by OpenOffice
Textract uses the default tools for every kind of file.
https://github.com/deanmalmgren/textract
One possible solution is to use google documents to extract the text contents from binary .doc-files. You upload the document to google docs and then download the text contents. It is a fairly slow process, but it is the only "pure Python" solution I know of since it doesn't require any external tools except for network access. An external tool such as catdoc or antiword is a much better solution if you are allowed to install it on your host.
.doc
files..doc
files: antiword and catdoc (and probably others). If the packages are installed on your shared host, you could use subprocess
to shell out to these tools. Available on Windows via Cygwin.subprocess
.