Encoding issue with apache poi converter

前端未结

关注

 2  1543

I have an ms word doc file that i\'m converting to an html document using apache poi.

this is the code i\'m running

    InputStream input = new FileInput


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2021-01-27 07:42
              
            
            
                                                                       
This is not an encoding problem but a font problem. Word uses ANSI code and special fonts for it's default bullet lists. The first bullet point for example is a bullet from font "Symbol". The second bullet point is a circle from font "Courier New", The third bullet point is a square from font "Wingdings". 

So the easiest possibility will be simply to replace the ANSI codes of the bullet texts with unicode. So done we can use UTF-8 for the HTML.

Example:

Word WordBulletList.doc:



Java:

import java.io.StringWriter;
import java.io.FileInputStream;
import java.io.File;
import java.io.PrintWriter;

import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import javax.xml.parsers.DocumentBuilderFactory;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.FontReplacer;
import org.apache.poi.hwpf.converter.FontReplacer.Triplet;

import org.w3c.dom.Document;

import java.awt.Desktop;

public class TestWordToHtmlConverter {

 public static void main(String[] args) throws Exception {

  Document newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

  WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) {

   protected void processParagraph(HWPFDocumentCore hwpfDocument, 
                                   org.w3c.dom.Element parentElement, 
                                   int currentTableLevel, 
                                   Paragraph paragraph, 
                                   java.lang.String bulletText) {
    if (bulletText!="") {
     //System.out.println((int)bulletText.charAt(0));
     bulletText = bulletText.replace("\uF0B7", "\u2022");
     bulletText = bulletText.replace("\u006F", "\u00A0\u00A0\u26AA");
     bulletText = bulletText.replace("\uF0A7", "\u00A0\u00A0\u00A0\u00A0\u25AA");
    }

    super.processParagraph(hwpfDocument, parentElement, currentTableLevel, paragraph, bulletText);
   }

  };

  wordToHtmlConverter.processDocument(new HWPFDocument(new FileInputStream("WordBulletList.doc")));

  StringWriter stringWriter = new StringWriter();
  Transformer transformer = TransformerFactory.newInstance().newTransformer();
  transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
  transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" );
  transformer.setOutputProperty( OutputKeys.METHOD, "html" );
  transformer.transform(new DOMSource(wordToHtmlConverter.getDocument()), new StreamResult(stringWriter));

  String html = stringWriter.toString();

  try(PrintWriter out = new PrintWriter("WordBulletList.html")) {
    out.println(html);
  }

  File htmlFile = new File("WordBulletList.html");
  Desktop.getDesktop().browse(htmlFile.toURI());

 }
}


HTML:

...
<body class="b1 b2">
<p class="p1">
<span>Word bullet list:</span>
</p>
<p class="p2">
<span class="s1">&bull;&nbsp;</span><span>Bullet1</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;⚪&nbsp;</span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;&nbsp;&nbsp;▪&nbsp;</span><span>Bullet3</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;⚪&nbsp;</span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">&bull;&nbsp;</span><span>Bullet1</span>
</p>
<p class="p1">
<span>End</span>
</p>
</body>
...

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2021-01-27 07:49
              
            
            
                                                                       
Problem SOLVED

I finally found a way to resolve this particular problem. The answer was inspired by @pawelini1 with his own question Encoding issue with Apache POI

The solution is simple, all I did was use a URLEncoder/Decoder on my html string

String html = URLEncoder.encode(new String(outStream.toByteArray(), "UTF-8"), "UTF-8");
String decoded = URLDecoder.decode(html, "UTF-8");


Now my webpage is displaying properly.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复