I am using Apache PDFBox and Java to parse the PDFs and get all the information from it. Extracting text is working fine for English only. For other languages I get only som
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.6.0/org/apache/pdfbox/util/PDFText2HTML.java
The private String escape(String chars) converts characters to unicode.
Try changing the Java system locale. From your Java program, this should be equivalent to changing the OS setting.