solution to convert PDFs, DOCs, DOCXs into a textual format with python

后端未结

关注

 4  1747

I am developing a full text search engine for indexing popular binary formats. I know that there are hundereds of such questions (and solutions) already, but I found it toug

相关标签:

4条回答

深忆病人

2021-01-16 17:58

If at server side you can use OpenOffice then you can use unoconv: Convert between any document format supported by OpenOffice

0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2021-01-16 17:59

Textract uses the default tools for every kind of file.

https://github.com/deanmalmgren/textract

0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2021-01-16 18:10

One possible solution is to use google documents to extract the text contents from binary .doc-files. You upload the document to google docs and then download the text contents. It is a fairly slow process, but it is the only "pure Python" solution I know of since it doesn't require any external tools except for network access. An external tool such as catdoc or antiword is a much better solution if you are allowed to install it on your host.

0 讨论(0)
发布评论:

提交评论
- 加载中...
忘了有多久

2021-01-16 18:23
- For PDFs, I recommend PDFminer.
- Try the docx module (I have not used it myself)
- I am not aware of any pure python module that can read .doc files.
- There are command-line tools to extract text from .doc files: antiword and catdoc (and probably others). If the packages are installed on your shared host, you could use subprocess to shell out to these tools. Available on Windows via Cygwin.
- Apache POI is a Java library that can extract text from Office documents. If your shared host has Java installed, you could write a bit of Java (or Jython) code and execute using subprocess.
0 讨论(0)
发布评论:

提交评论
- 加载中...