Word, PDF document parsing - Hadoop/in-general Java
问题 My objective is to load MS-Word, PDF etc. documents onto HDFS and extract certain 'content' out of each document and use it further for some analysis. Instead of beginning to fiddle with InputFormat etc., I thought that libraries like Tika can be used and incorporated in MR. The partial content of one of the Word doc. is as follows: 6. Statement of Strategy We have 4 strategic interventions that will deliver a competitive advantage. Innovate upstream and downstream 1. Biopulp. We will execute