Parsing PDF files in Hadoop Map Reduce

Deadly 提交于 2019-12-04 07:27:54

Processing PDF files in Hadoop can be done by extending FileInputFormat Class. Let the class extending it be WholeFileInputFormat. In the WholeFileInputFormat class you override the getRecordReader() method. Now each pdf will be received as an Individual Input Split. Then these individual splits can be parsed to extract the text. This link gives a clear example of understanding how to extend FileInputFormat.

It depends on your splits. I think (could be wrong) that you'll need each PDF as a whole in order to parse it. There are Java libraries to do this, and Google knows where they are.

Given that, you'll need to use an approach where you have the file as a whole when you're ready to parse it. Assuming you'd want to do that in the mapper, you'd need a reader that would hand whole files to the mapper. You could write your own reader to do this, or perhaps there's one already out there. You could possibly build a reader that scans the directory of PDFs and passes the name of each file as the key into the mapper and the contents as the value.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!