Reading equations from Word (*.docx) to HTML together with their text context using apache poi

问题

We are building a java code to read word document (.docx) into our program using apache POI. We are stuck when we encounter formulas and chemical equation inside the document. Yet, we managed to read formulas but we have no idea how to locate its index in concerned string..

INPUT (format is *.docx)

text before formulae **CHEMICAL EQUATION** text after

OUTPUT (format shall be HTML) we designed

text before formulae text after **CHEMICAL EQUATION**

We are unable to fetch the string and reconstruct to its original form.

Question

Now is there any way to locate the position of the image and formulae within the stripped line, so that it can be restored to its original form after reconstruction of the string, as against having it appended at the end of string.?

回答1:

       XWPFParagraph paragraph;

        for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
            formulas=formulas + getMathML(ctomath);
        }

With the above code it is able to extract the math formula from the given paragraph of a docx file.
Also for the purpose displaying the formula in a html page I m converting it to mathml code and rendering it with MathJax on the page. This I m able to do.
But the problem is, Is it possible to get the position of the formula in the given paragraph. So that I can display the formula in the exact location in the paragraph while rendering it as a html page.

回答2:

If the needed format is HTML, then Word text content together with Office MathML equations can be read the following way.

In Reading equations & formula from Word (Docx) to html and save database using java I have provided an example which gets all Office MathML equations out of an Word document into HTML. It uses paragraph.getCTP().getOMathList() and paragraph.getCTP().getOMathParaList() to get the OMath elements from the paragraph. This takes the OMath elements out of the text context.

If one wants get those OMath elements in context with the other elements in the paragraphs, then using a org.apache.xmlbeans.XmlCursor is needed to loop over all different XML elements in the paragraph. The following example uses the XmlCursor to get text runs together with OMath elements from the paragraph.

The transformation from Office MathML into MathML is taken using the same XSLT approach as in Reading equations & formula from Word (Docx) to html and save database using java. There also is described where the OMML2MML.XSL comes from.

The file Formula.docx looks like:

Code:

import java.io.*;
import org.apache.poi.xwpf.usermodel.*;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;

import org.apache.xmlbeans.XmlCursor;

import org.w3c.dom.Node;

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;

import java.awt.Desktop;

import java.util.List;
import java.util.ArrayList;

/*
needs the full ooxml-schemas-1.4.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/

public class WordReadTextWithFormulasAsHTML {

 static File stylesheet = new File("OMML2MML.XSL");
 static TransformerFactory tFactory = TransformerFactory.newInstance();
 static StreamSource stylesource = new StreamSource(stylesheet);

 //method for getting MathML from oMath
 static String getMathML(CTOMath ctomath) throws Exception {
  Transformer transformer = tFactory.newTransformer(stylesource);

  Node node = ctomath.getDomNode();

  DOMSource source = new DOMSource(node);
  StringWriter stringwriter = new StringWriter();
  StreamResult result = new StreamResult(stringwriter);
  transformer.setOutputProperty("omit-xml-declaration", "yes");
  transformer.transform(source, result);

  String mathML = stringwriter.toString();
  stringwriter.close();

  //The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
  //We don't need this since we want using the MathML in HTML, not in XML.
  //So ideally we should changing the OMML2MML.XSL to not do so.
  //But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
  mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
  mathML = mathML.replaceAll("xmlns:mml", "xmlns");
  mathML = mathML.replaceAll("mml:", "");

  return mathML;
 }

 //method for getting HTML including MathML from XWPFParagraph
 static String getTextAndFormulas(XWPFParagraph paragraph) throws Exception {

  StringBuffer textWithFormulas = new StringBuffer();

  //using a cursor to go through the paragraph from top to down
  XmlCursor xmlcursor = paragraph.getCTP().newCursor();

  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") && xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {
     //elements w:r are text runs within the paragraph
     //simply append the text data
     textWithFormulas.append(xmlcursor.getTextValue());
    } else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {
     //we have oMath
     //append the oMath as MathML
     textWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));
    } 
   } else if (tokentype.isEnd()) {
    //we have to check whether we are at the end of the paragraph
    xmlcursor.push();
    xmlcursor.toParent();
    if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {
     break;
    }
    xmlcursor.pop();
   }
  }

  return textWithFormulas.toString();
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));

  //using a StringBuffer for appending all the content as HTML
  StringBuffer allHTML = new StringBuffer();

  //loop over all IBodyElements - should be self explained
  for (IBodyElement ibodyelement : document.getBodyElements()) {
   if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
    XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
    allHTML.append("<p>");
    allHTML.append(getTextAndFormulas(paragraph));
    allHTML.append("</p>");
   } else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
    XWPFTable table = (XWPFTable)ibodyelement;
    allHTML.append("<table border=1>");
    for (XWPFTableRow row : table.getRows()) {
     allHTML.append("<tr>");
     for (XWPFTableCell cell : row.getTableCells()) {
      allHTML.append("<td>");
      for (XWPFParagraph paragraph : cell.getParagraphs()) {
       allHTML.append("<p>");
       allHTML.append(getTextAndFormulas(paragraph));
       allHTML.append("</p>");
      }
      allHTML.append("</td>");
     }
     allHTML.append("</tr>");
    }
    allHTML.append("</table>");
   }
  }

  document.close();

  //creating a sample HTML file 
  String encoding = "UTF-8";
  FileOutputStream fos = new FileOutputStream("result.html");
  OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
  writer.write("<!DOCTYPE html>\n");
  writer.write("<html lang=\"en\">");
  writer.write("<head>");
  writer.write("<meta charset=\"utf-8\"/>");

  //using MathJax for helping all browsers to interpret MathML
  writer.write("<script type=\"text/javascript\"");
  writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
  writer.write(">");
  writer.write("</script>");

  writer.write("</head>");
  writer.write("<body>");

  writer.write(allHTML.toString());

  writer.write("</body>");
  writer.write("</html>");
  writer.close();

  Desktop.getDesktop().browse(new File("result.html").toURI());

 }
}

Result:

来源：https://stackoverflow.com/questions/59414033/reading-equations-from-word-docx-to-html-together-with-their-text-context-us

标签

java

apache-poi

position

formula

equation