Encoding issue with apache poi converter

前端 未结 2 1540
庸人自扰
庸人自扰 2021-01-27 07:05

I have an ms word doc file that i\'m converting to an html document using apache poi.

this is the code i\'m running

    InputStream input = new FileInput         


        
相关标签:
2条回答
  • 2021-01-27 07:42

    This is not an encoding problem but a font problem. Word uses ANSI code and special fonts for it's default bullet lists. The first bullet point for example is a bullet from font "Symbol". The second bullet point is a circle from font "Courier New", The third bullet point is a square from font "Wingdings".

    So the easiest possibility will be simply to replace the ANSI codes of the bullet texts with unicode. So done we can use UTF-8 for the HTML.

    Example:

    Word WordBulletList.doc:

    Java:

    import java.io.StringWriter;
    import java.io.FileInputStream;
    import java.io.File;
    import java.io.PrintWriter;
    
    import javax.xml.transform.OutputKeys;
    import javax.xml.transform.Transformer;
    import javax.xml.transform.TransformerFactory;
    import javax.xml.transform.dom.DOMSource;
    import javax.xml.transform.stream.StreamResult;
    
    import javax.xml.parsers.DocumentBuilderFactory;
    
    import org.apache.poi.hwpf.HWPFDocument;
    import org.apache.poi.hwpf.HWPFDocumentCore;
    import org.apache.poi.hwpf.usermodel.Paragraph;
    import org.apache.poi.hwpf.converter.WordToHtmlConverter;
    import org.apache.poi.hwpf.converter.FontReplacer;
    import org.apache.poi.hwpf.converter.FontReplacer.Triplet;
    
    import org.w3c.dom.Document;
    
    import java.awt.Desktop;
    
    public class TestWordToHtmlConverter {
    
     public static void main(String[] args) throws Exception {
    
      Document newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
    
      WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) {
    
       protected void processParagraph(HWPFDocumentCore hwpfDocument, 
                                       org.w3c.dom.Element parentElement, 
                                       int currentTableLevel, 
                                       Paragraph paragraph, 
                                       java.lang.String bulletText) {
        if (bulletText!="") {
         //System.out.println((int)bulletText.charAt(0));
         bulletText = bulletText.replace("\uF0B7", "\u2022");
         bulletText = bulletText.replace("\u006F", "\u00A0\u00A0\u26AA");
         bulletText = bulletText.replace("\uF0A7", "\u00A0\u00A0\u00A0\u00A0\u25AA");
        }
    
        super.processParagraph(hwpfDocument, parentElement, currentTableLevel, paragraph, bulletText);
       }
    
      };
    
      wordToHtmlConverter.processDocument(new HWPFDocument(new FileInputStream("WordBulletList.doc")));
    
      StringWriter stringWriter = new StringWriter();
      Transformer transformer = TransformerFactory.newInstance().newTransformer();
      transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
      transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" );
      transformer.setOutputProperty( OutputKeys.METHOD, "html" );
      transformer.transform(new DOMSource(wordToHtmlConverter.getDocument()), new StreamResult(stringWriter));
    
      String html = stringWriter.toString();
    
      try(PrintWriter out = new PrintWriter("WordBulletList.html")) {
        out.println(html);
      }
    
      File htmlFile = new File("WordBulletList.html");
      Desktop.getDesktop().browse(htmlFile.toURI());
    
     }
    }
    

    HTML:

    ...
    <body class="b1 b2">
    <p class="p1">
    <span>Word bullet list:</span>
    </p>
    <p class="p2">
    <span class="s1">&bull;​&nbsp;</span><span>Bullet1</span>
    </p>
    <p class="p2">
    <span class="s1">&nbsp;&nbsp;⚪​&nbsp;</span><span>Bullet2</span>
    </p>
    <p class="p2">
    <span class="s1">&nbsp;&nbsp;&nbsp;&nbsp;▪​&nbsp;</span><span>Bullet3</span>
    </p>
    <p class="p2">
    <span class="s1">&nbsp;&nbsp;⚪​&nbsp;</span><span>Bullet2</span>
    </p>
    <p class="p2">
    <span class="s1">&bull;​&nbsp;</span><span>Bullet1</span>
    </p>
    <p class="p1">
    <span>End</span>
    </p>
    </body>
    ...
    
    0 讨论(0)
  • 2021-01-27 07:49

    Problem SOLVED

    I finally found a way to resolve this particular problem. The answer was inspired by @pawelini1 with his own question Encoding issue with Apache POI

    The solution is simple, all I did was use a URLEncoder/Decoder on my html string

    String html = URLEncoder.encode(new String(outStream.toByteArray(), "UTF-8"), "UTF-8");
    String decoded = URLDecoder.decode(html, "UTF-8");
    

    Now my webpage is displaying properly.

    0 讨论(0)
提交回复
热议问题