How to do fuzzy pattern matching with quanteda and kwic?

放肆的年华 提交于 2020-06-27 15:08:09

问题


I have texts written by doctors and I want to be able to highlight specific words in their context (5 words before and 5 words after the word I search for in their text). Say I want to search for the word 'suicidal'. I would then use the kwic function in the quanteda package:

kwic(dataset, pattern = “suicidal”, window = 5)

So far, so good, but say I want to allow for the possibility of typos. In this case I want to allow for three deviating characters, with no restriction on where in the word these are made.

Is it possible to do this using quanteda's kwic-function?

Example:

dataset <- data.frame("patient" = 1:9, "text" = c("On his first appointment, the patient was suicidal when he showed up in my office", 
                                  "On his first appointment, the patient was suicidaa when he showed up in my office",
                                  "On his first appointment, the patient was suiciaaa when he showed up in my office",
                                  "On his first appointment, the patient was suicaaal when he showed up in my office",
                                  "On his first appointment, the patient was suiaaaal when he showed up in my office",
                                  "On his first appointment, the patient was saacidal when he showed up in my office",
                                  "On his first appointment, the patient was suaaadal when he showed up in my office",
                                  "On his first appointment, the patient was icidal when he showed up in my office",
                                  "On his first appointment, the patient was uicida when he showed up in my office"))

dataset$text <- as.character(dataset$text)
kwic(dataset$text, pattern = "suicidal", window = 5)

would only give me the first, correctly spelled, sentence.


回答1:


Great question. We don't have approximate matching as a "valuetype" but that's an interesting idea for future development. In the meantime, I'd suggest generating a list of fixed fuzzy matches using base::agrep() and then matching on those. So this would look like:

library("quanteda")
## Package version: 1.5.2

dataset <- data.frame(
  "patient" = 1:9, "text" = c(
    "On his first appointment, the patient was suicidal when he showed up in my office",
    "On his first appointment, the patient was suicidaa when he showed up in my office",
    "On his first appointment, the patient was suiciaaa when he showed up in my office",
    "On his first appointment, the patient was suicaaal when he showed up in my office",
    "On his first appointment, the patient was suiaaaal when he showed up in my office",
    "On his first appointment, the patient was saacidal when he showed up in my office",
    "On his first appointment, the patient was suaaadal when he showed up in my office",
    "On his first appointment, the patient was icidal when he showed up in my office",
    "On his first appointment, the patient was uicida when he showed up in my office"
  ),
  stringsAsFactors = FALSE
)
corp <- corpus(dataset)

# get unique words
vocab <- tokens(corp, remove_numbers = TRUE, remove_punct = TRUE) %>%
  types()

The use agrep() to generate closest fuzzy matches - and here I ran tihs a few times, increasing max.distance each time slightly from the default of 0.1.

# get closest matches to "suicidal"
near_matches <- agrep("suicidal", vocab,
  max.distance = 0.3,
  ignore.case = TRUE, fixed = TRUE, value = TRUE
)
near_matches
## [1] "suicidal" "suicidaa" "suiciaaa" "suicaaal" "suiaaaal" "saacidal" "suaaadal"
## [8] "icidal"   "uicida"

Then, use this as the pattern argument to kwic():

# use these for fuzzy matching
kwic(corp, near_matches, window = 3)
##                                                        
##  [text1, 9] the patient was | suicidal | when he showed
##  [text2, 9] the patient was | suicidaa | when he showed
##  [text3, 9] the patient was | suiciaaa | when he showed
##  [text4, 9] the patient was | suicaaal | when he showed
##  [text5, 9] the patient was | suiaaaal | when he showed
##  [text6, 9] the patient was | saacidal | when he showed
##  [text7, 9] the patient was | suaaadal | when he showed
##  [text8, 9] the patient was |  icidal  | when he showed
##  [text9, 9] the patient was |  uicida  | when he showed

There are other possibilities based on similar solutions, for instance the fuzzyjoin or stringdist packages, but this is a simple solution from the base package that should work pretty well.



来源:https://stackoverflow.com/questions/59722865/how-to-do-fuzzy-pattern-matching-with-quanteda-and-kwic

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!