Word, PDF document parsing - Hadoop/in-general Java

问题

My objective is to load MS-Word, PDF etc. documents onto HDFS and extract certain 'content' out of each document and use it further for some analysis.

Instead of beginning to fiddle with InputFormat etc., I thought that libraries like Tika can be used and incorporated in MR.

The partial content of one of the Word doc. is as follows:

6.  Statement of Strategy 
We have 4 strategic interventions that will deliver a competitive advantage.
 Innovate upstream and downstream
1.  Biopulp.
We will execute Biopulp initially in corrugate for Haircare in China. This will validate the operational process of enzymatically converting straw into pulp and paper. Then we will establish a Joint Development with Family care to extend the sources of value. And finally re-apply globally including across for other sectors and customers to maximize the value generation.
2.  Mandrel Case Forming
We will extend the use of MCF technology within WE for the businesses that already use MCF cases. (i.e. F&HC). In parallel we will establish this as the global standard for HDL’s and HDW. We will seek additional suppliers to execute this technology in other regions (e.g. NA and Asia) to increase capacity and reduce cost of execution.

Supplier Strategy for Competition
3.  Competition in practice 
We have used negotiation as the primary process for establishing prices and supply agreements. We will more effectively create and utilize competition by using enquiries for each of our plants. This may require that we trigger new investments and qualify additional facilities, but with the consolidation going on in the industry it should not cause a net increase in suppliers.
4.  Cost input pass-through
Our current agreements in general use paper as the primary driver of our feedstock clauses. If paper prices go up then our suppliers are happy and we are not. If paper prices go down then we are happy and our suppliers’ are not. This means that almost 100% of the time one party is not happy. If we change our pass-through clauses to be driven by our suppliers’ input costs, then we align ourselves with their interests which will generate less transaction cost and increase collaboration

Optimum Sourcing Principles for Corrugates

<A TABLE HERE>

7.  Tactical Planning and Execution

<A TABLE HERE>

Suppose, I wish to do the following :

Extract the table under 'Optimum Sourcing Principles for Corrugates'
The bullet points under 'Innovate upstream and downstream'

While this seems crazy and absurd, I was wondering if Tika(tried this but stuck with just metadata and the file as string), Lucene/Solr, POI etc. can help to parse and 'understand' the Word, PDF documents and allow to extract a block of data based on some search string(or regex).

For example, I used Tika Parser and got the following output which is too naive('A TABLE HERE' i.e a table in the Word doc. interpreted as paragraphs !) to help in content extraction :

6.  Statement of Strategy 
We have 4 strategic interventions that will deliver a competitive advantage to P&G.
 Innovate upstream and downstream
Biopulp.
We will execute Biopulp initially in corrugate for Haircare in China. This will validate the operational process of enzymatically converting straw into pulp and paper. Then we will establish a Joint Development with Family care to extend the sources of value. And finally re-apply globally including across for other sectors and customers to maximize the value generation.
Mandrel Case Forming
We will extend the use of MCF technology within WE for the businesses that already use MCF cases. (i.e. F&HC). In parallel we will establish this as the global standard for HDL’s and HDW. We will seek additional suppliers to execute this technology in other regions (e.g. NA and Asia) to increase capacity and reduce cost of execution.

Supplier Strategy for Competition
Competition in practice 
We have used negotiation as the primary process for establishing prices and supply agreements. We will more effectively create and utilize competition by using enquiries for each of our plants. This may require that we trigger new investments and qualify additional facilities, but with the consolidation going on in the industry it should not cause a net increase in suppliers.
Cost input pass-through
Our current agreements in general use paper as the primary driver of our feedstock clauses. If paper prices go up then our suppliers are happy and we are not. If paper prices go down then we are happy and our suppliers’ are not. This means that almost 100% of the time one party is not happy. If we change our pass-through clauses to be driven by our suppliers’ input costs, then we align ourselves with their interests which will generate less transaction cost and increase collaboration.





Optimum Sourcing Principles for Corrugates
    principle
    optimum
    rationale

    Number of  suppliers
    2-3 per plant
>80% with 5 per region/country cluster
    Competition is local
Scale the spend with central accounts

    Global/local suppliers
    Regional is sufficient
    No advantage to global as scale is regional only and there is limited IP to transfer.
Larger regional suppliers can consolidate local single-plant suppliers to make it efficient for us. They also bring capital for machinery upgrading and scale for paper source.

    Approach to suppliers
    collaborative
    Competition to drive price is clear; preferential and value-add deals require collaboration

    Make v buy
    buy
    Multiple suppliers; commoditised technologies

    Distance of suppliers to plant
    Max 300km for boxes (300miles in NA); up to 1000km for paper reels.
Can be longer for specialist print grades or to countries with no high quality local supply
    Economic max as high volume product (air in the fluting)
Need recent built paper machines to produce paper strong enough to run on high-speed corrugators

    Type of suppliers
    Integrated with containerboard making

Corrugators on-site
    To assure supply and avoid being leveraged by paper making scale
Cost structure not competitive if have to buy in board (shipping air)

    Purchase of feedstocks
    Not if integrated suppliers
    Integrated suppliers have 20x our scale

    Length and nature of contracts
    Multiple year (2-3), but with fixed glidepath pricing/value every year
    Significant effort for Purchases to re-enquire annually. High number of specs and low resources mean long time to qualify relative to additional value if only 12 month allocation.

    Specifications
    Standard board weights


Tailored box sizes
    Paper scale much higher so uneconomic to make tailored weight
Maximising pallet fit delivers better savings and stronger pallet (less transport damages) than scale savings of standard box size.

    Terms
    Standard, including payment terms
    High degree of competition, no specialist investment. Paper making has good cash-flow, so no need for shorter payment terms.

Below is the sample TIKA code I wrote(I couldn't figure out how to do the above when different types(pdf, MS-Word etc.) of documents arrive

private void parseFileForContent(String absolutePath) throws IOException,
            SAXException, TikaException {
        // TODO Auto-generated method stub

        System.out.println("absolutePath : " + absolutePath);

        Tika tika = new Tika();

        File path = new File(absolutePath);

        if (path.isDirectory()) {

            File[] files = path.listFiles();

            for (File file : files) {

                System.out.println("File type is " + tika.detect(file));
            }
        } else {
            System.out.println("File type is " + tika.detect(path));

            Parser parser = new AutoDetectParser();

            ContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();

            parser.parse(TikaInputStream.get(path), handler, metadata,
                    new ParseContext());

            //displayMetadata(metadata);

            System.out.println("Handler "+handler.toString());
        }

    }

I wish to use Tika as Apache POI is confined to MS documents but I could do something sensible with POI like extracting paragraphs, tables etc.

package com.lnt.sap.sp2.scratchpad;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;

import org.apache.poi.xwpf.usermodel.IBodyElement;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFTable;
import org.apache.poi.xwpf.usermodel.XWPFTableCell;
import org.apache.poi.xwpf.usermodel.XWPFTableRow;

public class POIScratchpad {

    public static void main(String[] args) {
        // TODO Auto-generated method stub

        String absolutePath = args[0];

        POIScratchpad poiScratchpad = new POIScratchpad();

        poiScratchpad.parseMSDocuments(absolutePath);
    }

    private void parseMSDocuments(String absolutePath) {
        // TODO Auto-generated method stub

        try {

            XWPFDocument doc = new XWPFDocument(new FileInputStream(
                    absolutePath));

            displayElements(doc);
            // displayParagraphs(doc);
            // displayTables(doc);

        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    private void displayElements(XWPFDocument doc) {
        // TODO Auto-generated method stub

        java.util.Iterator<IBodyElement> bodyElementIterator = doc
                .getBodyElementsIterator();

        int cnt = 0;

        while (bodyElementIterator.hasNext()) {
            IBodyElement element = bodyElementIterator.next();

            System.out.println("**********" + cnt + "**********");

            System.out.println("Element type is " + element.getElementType());
            System.out.println("Part is : " + element.getPart());
            System.out.println("Part Type is : " + element.getPartType());
            System.out.println("Body is : " + element.getBody());
            System.out.println("element is " + element);

            System.out.println("**********");

            cnt++;
        }
    }

    private void displayParagraphs(XWPFDocument doc) {
        // TODO Auto-generated method stub
        List<XWPFParagraph> paragraphs = doc.getParagraphs();

        int cnt = 0;

        for (XWPFParagraph paragraph : paragraphs) {

            System.out.println("**********" + cnt + "**********");
            System.out.println(paragraph.getParagraphText());
            System.out.println("********************");

            cnt++;
        }
    }

    private void displayTables(XWPFDocument doc) {
        // TODO Auto-generated method stub

        Iterator<XWPFTable> tableIterator = doc.getTablesIterator();

        int cnt = 0;

        while (tableIterator.hasNext()) {

            XWPFTable table = tableIterator.next();

            System.out.println("**********" + cnt + "**********");

            List<XWPFTableRow> rows = table.getRows();

            for (XWPFTableRow row : rows) {

                List<XWPFTableCell> cells = row.getTableCells();

                for (XWPFTableCell cell : cells) {
                    System.out.println(cell.getText());
                }
            }

            System.out.println("********************");

            cnt++;
        }
    }
}

How do I proceed? Where are my assumptions unrealistic or more information from the document is required?

来源：https://stackoverflow.com/questions/25054901/word-pdf-document-parsing-hadoop-in-general-java

标签

java

Hadoop

apache-poi

text-parsing

apache-tika