I am writing a Master\'s thesis - NLP system. I have one component - extractor.
It is extracting a plain text from PDF files. There are a few PDF files that can not be
PDF is not a text document. It's more of a vector graphic format that sometimes can contain text. So there are some documents from which you can't extract text unless you are willing to do OCR. That's just the way it is.