Check for multiple words in string match for text search in r

后端 未结 1 1000
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-23 23:50

Presently I have a code which works for one word search, can we search multiple words and write those matched words in a dataframe? (for clarification, please refer to this post

1条回答
  •  遥遥无期
    2021-01-24 00:17

    You go through every PDF in your directory with the outside loop. Then you go through all pages of the PDF and extract the text in the inner loop. You want to check for every document whether at least one page contains either school, gym or swimming pool. The returned values you want to use are:

    1. a vector of the length of the number of PDF documents containing either Present or Not present.
    2. Three vector with some strings, containing information on which word occurs where and when.

    Right?

    You can skip a couple of steps in your loop, especially while transforming PDFs to TIFFs and reading texts from them with ocr:

    all_files <- Sys.glob("*.pdf")
    strings   <- c("school", "gym", "swimming pool")
    
    # Read text from pdfs
    texts <- lapply(all_files, function(x){
                    img_file <- pdf_convert(x, format="tiff", dpi=400)
                    return( tolower(ocr(img_file)) )
                    })
    
    # Check for presence of each word in checkthese
    pages <- words <- vector("list", length(texts))
    for(d in seq_along(texts)){
      for(w in seq_along(strings)){
        intermed   <- grep(strings[w], texts[[d]])
        words[[d]] <- c(words[[d]], 
                        strings[w][ (length(intermed) > 0) ])
        pages[[d]] <- unique(c(pages[[d]], intermed))
      }
    }
    
    # Organize data so that it suits your wanted output
    fileName <- tools::file_path_sans_ext(basename(all_files))
    
    Page <- Map(paste0, fileName, "_", pages, collapse=", ")
    Page[!grepl(",", Page)] <- "-"
    Page <- t(data.frame(Page))
    
    Words    <- sapply(words, paste0, collapse=", ")
    Status   <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")
    
    data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)        
    #       Status                                   Page                      Words
    # pdf1 Present                         pdf1_1, pdf1_2         gym, swimming pool
    # pdf2 Present pdf2_2, pdf2_5, pdf2_8, pdf2_3, pdf2_6 school, gym, swimming pool
    

    It's not as readable as I'd like it to be. Probably because little requirements w.r.t. the output require minor intermediate steps that make the code seem a bit chaotic. It works well, though

    0 讨论(0)
提交回复
热议问题