问题
My objective is to load MS-Word, PDF etc. documents onto HDFS and extract certain 'content' out of each document and use it further for some analysis.
Instead of beginning to fiddle with InputFormat etc., I thought that libraries like Tika can be used and incorporated in MR.
The partial content of one of the Word doc. is as follows:
6. Statement of Strategy
We have 4 strategic interventions that will deliver a competitive advantage.
Innovate upstream and downstream
1. Biopulp.
We will execute Biopulp initially in corrugate for Haircare in China. This will validate the operational process of enzymatically converting straw into pulp and paper. Then we will establish a Joint Development with Family care to extend the sources of value. And finally re-apply globally including across for other sectors and customers to maximize the value generation.
2. Mandrel Case Forming
We will extend the use of MCF technology within WE for the businesses that already use MCF cases. (i.e. F&HC). In parallel we will establish this as the global standard for HDL’s and HDW. We will seek additional suppliers to execute this technology in other regions (e.g. NA and Asia) to increase capacity and reduce cost of execution.
Supplier Strategy for Competition
3. Competition in practice
We have used negotiation as the primary process for establishing prices and supply agreements. We will more effectively create and utilize competition by using enquiries for each of our plants. This may require that we trigger new investments and qualify additional facilities, but with the consolidation going on in the industry it should not cause a net increase in suppliers.
4. Cost input pass-through
Our current agreements in general use paper as the primary driver of our feedstock clauses. If paper prices go up then our suppliers are happy and we are not. If paper prices go down then we are happy and our suppliers’ are not. This means that almost 100% of the time one party is not happy. If we change our pass-through clauses to be driven by our suppliers’ input costs, then we align ourselves with their interests which will generate less transaction cost and increase collaboration
Optimum Sourcing Principles for Corrugates
<A TABLE HERE>
7. Tactical Planning and Execution
<A TABLE HERE>
Suppose, I wish to do the following :
- Extract the table under 'Optimum Sourcing Principles for Corrugates'
- The bullet points under 'Innovate upstream and downstream'
While this seems crazy and absurd, I was wondering if Tika(tried this but stuck with just metadata and the file as string), Lucene/Solr, POI etc. can help to parse and 'understand' the Word, PDF documents and allow to extract a block of data based on some search string(or regex).
For example, I used Tika Parser and got the following output which is too naive('A TABLE HERE' i.e a table in the Word doc. interpreted as paragraphs !) to help in content extraction :
6. Statement of Strategy
We have 4 strategic interventions that will deliver a competitive advantage to P&G.
Innovate upstream and downstream
Biopulp.
We will execute Biopulp initially in corrugate for Haircare in China. This will validate the operational process of enzymatically converting straw into pulp and paper. Then we will establish a Joint Development with Family care to extend the sources of value. And finally re-apply globally including across for other sectors and customers to maximize the value generation.
Mandrel Case Forming
We will extend the use of MCF technology within WE for the businesses that already use MCF cases. (i.e. F&HC). In parallel we will establish this as the global standard for HDL’s and HDW. We will seek additional suppliers to execute this technology in other regions (e.g. NA and Asia) to increase capacity and reduce cost of execution.
Supplier Strategy for Competition
Competition in practice
We have used negotiation as the primary process for establishing prices and supply agreements. We will more effectively create and utilize competition by using enquiries for each of our plants. This may require that we trigger new investments and qualify additional facilities, but with the consolidation going on in the industry it should not cause a net increase in suppliers.
Cost input pass-through
Our current agreements in general use paper as the primary driver of our feedstock clauses. If paper prices go up then our suppliers are happy and we are not. If paper prices go down then we are happy and our suppliers’ are not. This means that almost 100% of the time one party is not happy. If we change our pass-through clauses to be driven by our suppliers’ input costs, then we align ourselves with their interests which will generate less transaction cost and increase collaboration.
Optimum Sourcing Principles for Corrugates
principle
optimum
rationale
Number of suppliers
2-3 per plant
>80% with 5 per region/country cluster
Competition is local
Scale the spend with central accounts
Global/local suppliers
Regional is sufficient
No advantage to global as scale is regional only and there is limited IP to transfer.
Larger regional suppliers can consolidate local single-plant suppliers to make it efficient for us. They also bring capital for machinery upgrading and scale for paper source.
Approach to suppliers
collaborative
Competition to drive price is clear; preferential and value-add deals require collaboration
Make v buy
buy
Multiple suppliers; commoditised technologies
Distance of suppliers to plant
Max 300km for boxes (300miles in NA); up to 1000km for paper reels.
Can be longer for specialist print grades or to countries with no high quality local supply
Economic max as high volume product (air in the fluting)
Need recent built paper machines to produce paper strong enough to run on high-speed corrugators
Type of suppliers
Integrated with containerboard making
Corrugators on-site
To assure supply and avoid being leveraged by paper making scale
Cost structure not competitive if have to buy in board (shipping air)
Purchase of feedstocks
Not if integrated suppliers
Integrated suppliers have 20x our scale
Length and nature of contracts
Multiple year (2-3), but with fixed glidepath pricing/value every year
Significant effort for Purchases to re-enquire annually. High number of specs and low resources mean long time to qualify relative to additional value if only 12 month allocation.
Specifications
Standard board weights
Tailored box sizes
Paper scale much higher so uneconomic to make tailored weight
Maximising pallet fit delivers better savings and stronger pallet (less transport damages) than scale savings of standard box size.
Terms
Standard, including payment terms
High degree of competition, no specialist investment. Paper making has good cash-flow, so no need for shorter payment terms.
Below is the sample TIKA code I wrote(I couldn't figure out how to do the above when different types(pdf, MS-Word etc.) of documents arrive
private void parseFileForContent(String absolutePath) throws IOException,
SAXException, TikaException {
// TODO Auto-generated method stub
System.out.println("absolutePath : " + absolutePath);
Tika tika = new Tika();
File path = new File(absolutePath);
if (path.isDirectory()) {
File[] files = path.listFiles();
for (File file : files) {
System.out.println("File type is " + tika.detect(file));
}
} else {
System.out.println("File type is " + tika.detect(path));
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(TikaInputStream.get(path), handler, metadata,
new ParseContext());
//displayMetadata(metadata);
System.out.println("Handler "+handler.toString());
}
}
I wish to use Tika as Apache POI is confined to MS documents but I could do something sensible with POI like extracting paragraphs, tables etc.
package com.lnt.sap.sp2.scratchpad;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import org.apache.poi.xwpf.usermodel.IBodyElement;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFTable;
import org.apache.poi.xwpf.usermodel.XWPFTableCell;
import org.apache.poi.xwpf.usermodel.XWPFTableRow;
public class POIScratchpad {
public static void main(String[] args) {
// TODO Auto-generated method stub
String absolutePath = args[0];
POIScratchpad poiScratchpad = new POIScratchpad();
poiScratchpad.parseMSDocuments(absolutePath);
}
private void parseMSDocuments(String absolutePath) {
// TODO Auto-generated method stub
try {
XWPFDocument doc = new XWPFDocument(new FileInputStream(
absolutePath));
displayElements(doc);
// displayParagraphs(doc);
// displayTables(doc);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void displayElements(XWPFDocument doc) {
// TODO Auto-generated method stub
java.util.Iterator<IBodyElement> bodyElementIterator = doc
.getBodyElementsIterator();
int cnt = 0;
while (bodyElementIterator.hasNext()) {
IBodyElement element = bodyElementIterator.next();
System.out.println("**********" + cnt + "**********");
System.out.println("Element type is " + element.getElementType());
System.out.println("Part is : " + element.getPart());
System.out.println("Part Type is : " + element.getPartType());
System.out.println("Body is : " + element.getBody());
System.out.println("element is " + element);
System.out.println("**********");
cnt++;
}
}
private void displayParagraphs(XWPFDocument doc) {
// TODO Auto-generated method stub
List<XWPFParagraph> paragraphs = doc.getParagraphs();
int cnt = 0;
for (XWPFParagraph paragraph : paragraphs) {
System.out.println("**********" + cnt + "**********");
System.out.println(paragraph.getParagraphText());
System.out.println("********************");
cnt++;
}
}
private void displayTables(XWPFDocument doc) {
// TODO Auto-generated method stub
Iterator<XWPFTable> tableIterator = doc.getTablesIterator();
int cnt = 0;
while (tableIterator.hasNext()) {
XWPFTable table = tableIterator.next();
System.out.println("**********" + cnt + "**********");
List<XWPFTableRow> rows = table.getRows();
for (XWPFTableRow row : rows) {
List<XWPFTableCell> cells = row.getTableCells();
for (XWPFTableCell cell : cells) {
System.out.println(cell.getText());
}
}
System.out.println("********************");
cnt++;
}
}
}
How do I proceed? Where are my assumptions unrealistic or more information from the document is required?
来源:https://stackoverflow.com/questions/25054901/word-pdf-document-parsing-hadoop-in-general-java