I have an ms word doc file that i\'m converting to an html document using apache poi.
this is the code i\'m running
InputStream input = new FileInput
This is not an encoding problem but a font problem. Word
uses ANSI
code and special fonts for it's default bullet lists. The first bullet point for example is a bullet from font "Symbol". The second bullet point is a circle from font "Courier New", The third bullet point is a square from font "Wingdings".
So the easiest possibility will be simply to replace the ANSI
codes of the bullet texts with unicode. So done we can use UTF-8 for the HTML.
Example:
Word WordBulletList.doc
:
Java:
import java.io.StringWriter;
import java.io.FileInputStream;
import java.io.File;
import java.io.PrintWriter;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.parsers.DocumentBuilderFactory;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.FontReplacer;
import org.apache.poi.hwpf.converter.FontReplacer.Triplet;
import org.w3c.dom.Document;
import java.awt.Desktop;
public class TestWordToHtmlConverter {
public static void main(String[] args) throws Exception {
Document newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) {
protected void processParagraph(HWPFDocumentCore hwpfDocument,
org.w3c.dom.Element parentElement,
int currentTableLevel,
Paragraph paragraph,
java.lang.String bulletText) {
if (bulletText!="") {
//System.out.println((int)bulletText.charAt(0));
bulletText = bulletText.replace("\uF0B7", "\u2022");
bulletText = bulletText.replace("\u006F", "\u00A0\u00A0\u26AA");
bulletText = bulletText.replace("\uF0A7", "\u00A0\u00A0\u00A0\u00A0\u25AA");
}
super.processParagraph(hwpfDocument, parentElement, currentTableLevel, paragraph, bulletText);
}
};
wordToHtmlConverter.processDocument(new HWPFDocument(new FileInputStream("WordBulletList.doc")));
StringWriter stringWriter = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" );
transformer.setOutputProperty( OutputKeys.METHOD, "html" );
transformer.transform(new DOMSource(wordToHtmlConverter.getDocument()), new StreamResult(stringWriter));
String html = stringWriter.toString();
try(PrintWriter out = new PrintWriter("WordBulletList.html")) {
out.println(html);
}
File htmlFile = new File("WordBulletList.html");
Desktop.getDesktop().browse(htmlFile.toURI());
}
}
HTML:
...
Word bullet list:
• Bullet1
⚪ Bullet2
▪ Bullet3
⚪ Bullet2
• Bullet1
End
...