问题
Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks.
Any idea (not only on Tika) to extract PDF text while converting character ligatures to separated characters ?
File file = new File("path/to/file.pdf");
String text = Tika().parseToString(file);
Edit
My PDF file is UTF-8 encoded (that's what InputStream.getEncoding()
says), my platform encoding is also UTF-8. Even with a -Dfile.encoding=UTF8
, it is not working.
For instance, I'm supposed to have : "différentes implémentations" ...and that's what I really get : "di��erentes impl�ementations"
来源:https://stackoverflow.com/questions/22348632/handle-ligatures-in-apache-tika