Copy+pasting text from PDF results in garbage

后端 未结 7 2466
无人及你
无人及你 2021-02-20 00:37

I am writing a Master\'s thesis - NLP system. I have one component - extractor.

It is extracting a plain text from PDF files. There are a few PDF files that can not be

7条回答
  •  清歌不尽
    2021-02-20 01:38

    PDF is not a text document. It's more of a vector graphic format that sometimes can contain text. So there are some documents from which you can't extract text unless you are willing to do OCR. That's just the way it is.

提交回复
热议问题