Presently I have a code which works for one word search, can we search multiple words and write those matched words in a dataframe? (for clarification, please refer to this post
You go through every PDF in your directory with the outside loop. Then you go through all pages of the PDF and extract the text in the inner loop. You want to check for every document whether at least one page contains either school
, gym
or swimming pool
. The returned values you want to use are:
Present
or Not present
.Right?
You can skip a couple of steps in your loop, especially while transforming PDFs to TIFFs and reading texts from them with ocr
:
all_files <- Sys.glob("*.pdf")
strings <- c("school", "gym", "swimming pool")
# Read text from pdfs
texts <- lapply(all_files, function(x){
img_file <- pdf_convert(x, format="tiff", dpi=400)
return( tolower(ocr(img_file)) )
})
# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
for(w in seq_along(strings)){
intermed <- grep(strings[w], texts[[d]])
words[[d]] <- c(words[[d]],
strings[w][ (length(intermed) > 0) ])
pages[[d]] <- unique(c(pages[[d]], intermed))
}
}
# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))
Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))
Words <- sapply(words, paste0, collapse=", ")
Status <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")
data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)
# Status Page Words
# pdf1 Present pdf1_1, pdf1_2 gym, swimming pool
# pdf2 Present pdf2_2, pdf2_5, pdf2_8, pdf2_3, pdf2_6 school, gym, swimming pool
It's not as readable as I'd like it to be. Probably because little requirements w.r.t. the output require minor intermediate steps that make the code seem a bit chaotic. It works well, though