Text mining pdf files/issues with word frequencies
问题 I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions? # Here is the link to pdf file for testing # www.sciencedirect.com/science/article/pii/S0164121212000532 library(tm