Use R to convert PDF files to text files for text mining

前端 未结 2 812
刺人心
刺人心 2020-12-13 02:51

I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article\'s abstracts from the whole folder. Now I am doing the following:

<         


        
2条回答
  •  囚心锁ツ
    2020-12-13 03:19

    Yes, not really an R question as IShouldBuyABoat notes, but something that R can do with only minor contortions...

    Use R to convert PDF files to txt files...

    # folder with 1000s of PDFs
    dest <- "C:\\Users\\Desktop"
    
    # make a vector of PDF file names
    myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)
    
    # convert each PDF file that is named in the vector into a text file 
    # text file is created in the same directory as the PDFs
    # note that my pdftotext.exe is in a different location to yours
    lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', 
                 paste0('"', i, '"')), wait = FALSE) )
    

    Extract only abstracts from txt files...

    # if you just want the abstracts, we can use regex to extract that part of
    # each txt file, Assumes that the abstract is always between the words 'Abstract'
    # and 'Introduction'
    mytxtfiles <- list.files(path = dest, pattern = "txt",  full.names = TRUE)
    abstracts <- lapply(mytxtfiles, function(i) {
      j <- paste0(scan(i, what = character()), collapse = " ")
      regmatches(j, gregexpr("(?<=Abstract).*?(?=Introduction)", j, perl=TRUE))
    })
    

    Write abstracts into separate txt files...

    # write abstracts as txt files 
    # (or use them in the list for whatever you want to do next)
    lapply(1:length(abstracts),  function(i) write.table(abstracts[i], file=paste(mytxtfiles[i], "abstract", "txt", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))
    

    And now you're ready to do some text mining on the abstracts.

提交回复
热议问题