replace text using values from lookup table without for loop

半城伤御伤魂 提交于 2021-02-17 06:31:08

问题


I'm writing a function for spelling correction. I scraped spelling variants page from wikipedia and converted it into a table. I want to now use this as lookup table (spellings) and replace values in my documents (skills.db). NOTE: skills data frame below is just an example. ignore the second column. I will be performing the spelling correction much earlier in the process on resumes. resumes are large, so i thought I'll share this instead.

I can do this using a for loop as below, however I'm wondering if there's a better solution

spellings = structure(list(preferred_spellings = c("organisation", "acknowledgement", 
"cypher", "anaesthesia", "analyse"), other_spellings = c(" organization", 
" acknowledgment", " cipher", " anesthesia", " analyze")), row.names = c(NA, 
5L), class = "data.frame")

skills.db = structure(list(skills = c("variance analysis static", "analyze kpi", 
"financial analysis", "variance analysis", "organizational", 
"analysis", "organize", "result analysis", "analytic", "datum analysis", 
"analytics", "business analysis", "organized", "quantitative analysis", 
"train need analysis", "analytic think", "analysis trial preparation", 
"analyze statue", "google analytics", "service analysis", "organize individual", 
"account analysis", "analyze department work", "pareto analysis train", 
"organization", "ratio analysis", "statistical analysis", "project organization", 
"organize client's file", "with good analytic", "nielsen analytics", 
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics", 
"market analysis", "analyse", "analytic skill", "superb analytic", 
"financial statement analysis", "credit analysis", "quick analysis", 
"organizational development", "outstanding financial analytic", 
"organization design", "organize conference", "business analytics", 
"industry analysis", "fs analysis", "analyze", "cash flow analysis", 
"investment analysis", "technical analysis bloomberg", "community organize", 
"monthly financial analysis", "expense variance analysis", "stock analysis"
), level1 = c("variance analysis static", "analyze kpi", "financial analysis", 
"variance analysis", "organizational", "analysis", "organize", 
"result analysis", "analytic", "datum analysis", "analytics", 
"business analysis", "organized", "quantitative analysis", "train need analysis", 
"analytic think", "analysis trial preparation", "analyze statue", 
"google analytics", "service analysis", "organize individual", 
"account analysis", "analyze department work", "pareto analysis train", 
"organization", "ratio analysis", "statistical analysis", "project organization", 
"organize client's file", "with good analytic", "nielsen analytics", 
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics", 
"market analysis", "analyse", "analytic skill", "superb analytic", 
"financial statement analysis", "credit analysis", "quick analysis", 
"organizational development", "outstanding financial analytic", 
"organization design", "organize conference", "business analytics", 
"industry analysis", "fs analysis", "analyze", "cash flow analysis", 
"investment analysis", "technical analysis bloomberg", "community organize", 
"monthly financial analysis", "expense variance analysis", "stock analysis"
)), row.names = c(49L, 65L, 77L, 82L, 155L, 190L, 215L, 244L, 
246L, 260L, 287L, 300L, 311L, 323L, 349L, 356L, 378L, 386L, 447L, 
607L, 622L, 664L, 686L, 766L, 824L, 832L, 895L, 922L, 928L, 949L, 
1020L, 1054L, 1079L, 1080L, 1081L, 1088L, 1146L, 1158L, 1228L, 
1248L, 1319L, 1366L, 1385L, 1440L, 1468L, 1475L, 1509L, 1554L, 
1584L, 1606L, 1635L, 1658L, 1660L, 1696L, 1760L, 1762L, 1798L
), class = "data.frame")

for(i in 1:nrow(spellings)){
    skills.db = skills.db %>% mutate(TEST = gsub(spellings$other_spellings[i], spellings$preferred_spellings[i], skills))
  } 

回答1:


Here's one method, using Reduce (which could easily be purrr::reduce) to iterate over each of the spellings and correct them.

spellings_list <- asplit(spellings, 1)
skills.db %>%
  mutate(TEST = Reduce(function(txt, spl) gsub(spl[2], spl[1], txt), spellings_list, init = skills), changed = (skills != TEST))
#                             skills                          level1                            TEST changed
# 1         variance analysis static        variance analysis static        variance analysis static   FALSE
# 2                      analyze kpi                     analyze kpi                     analyse kpi    TRUE
# 3               financial analysis              financial analysis              financial analysis   FALSE
# 4                variance analysis               variance analysis               variance analysis   FALSE
# 5                   organizational                  organizational                  organisational    TRUE
# 6                         analysis                        analysis                        analysis   FALSE
# 7                         organize                        organize                        organize   FALSE
# 8                  result analysis                 result analysis                 result analysis   FALSE
# 9                         analytic                        analytic                        analytic   FALSE
# 10                  datum analysis                  datum analysis                  datum analysis   FALSE
# 11                       analytics                       analytics                       analytics   FALSE
# 12               business analysis               business analysis               business analysis   FALSE
# 13                       organized                       organized                       organized   FALSE
# 14           quantitative analysis           quantitative analysis           quantitative analysis   FALSE
# 15             train need analysis             train need analysis             train need analysis   FALSE
# 16                  analytic think                  analytic think                  analytic think   FALSE
# 17      analysis trial preparation      analysis trial preparation      analysis trial preparation   FALSE
# 18                  analyze statue                  analyze statue                  analyse statue    TRUE
# 19                google analytics                google analytics                google analytics   FALSE
# 20                service analysis                service analysis                service analysis   FALSE
# 21             organize individual             organize individual             organize individual   FALSE
# 22                account analysis                account analysis                account analysis   FALSE
# 23         analyze department work         analyze department work         analyse department work    TRUE
# 24           pareto analysis train           pareto analysis train           pareto analysis train   FALSE
# 25                    organization                    organization                    organisation    TRUE
# 26                  ratio analysis                  ratio analysis                  ratio analysis   FALSE
# 27            statistical analysis            statistical analysis            statistical analysis   FALSE
# 28            project organization            project organization            project organisation    TRUE
# 29          organize client's file          organize client's file          organize client's file   FALSE
# 30              with good analytic              with good analytic              with good analytic   FALSE
# 31               nielsen analytics               nielsen analytics               nielsen analytics   FALSE
# 32                 datum analytics                 datum analytics                 datum analytics   FALSE
# 33               textual analytics               textual analytics               textual analytics   FALSE
# 34                social analytics                social analytics                social analytics   FALSE
# 35 business intelligence analytics business intelligence analytics business intelligence analytics   FALSE
# 36                 market analysis                 market analysis                 market analysis   FALSE
# 37                         analyse                         analyse                         analyse   FALSE
# 38                  analytic skill                  analytic skill                  analytic skill   FALSE
# 39                 superb analytic                 superb analytic                 superb analytic   FALSE
# 40    financial statement analysis    financial statement analysis    financial statement analysis   FALSE
# 41                 credit analysis                 credit analysis                 credit analysis   FALSE
# 42                  quick analysis                  quick analysis                  quick analysis   FALSE
# 43      organizational development      organizational development      organisational development    TRUE
# 44  outstanding financial analytic  outstanding financial analytic  outstanding financial analytic   FALSE
# 45             organization design             organization design             organisation design    TRUE
# 46             organize conference             organize conference             organize conference   FALSE
# 47              business analytics              business analytics              business analytics   FALSE
# 48               industry analysis               industry analysis               industry analysis   FALSE
# 49                     fs analysis                     fs analysis                     fs analysis   FALSE
# 50                         analyze                         analyze                         analyse    TRUE
# 51              cash flow analysis              cash flow analysis              cash flow analysis   FALSE
# 52             investment analysis             investment analysis             investment analysis   FALSE
# 53    technical analysis bloomberg    technical analysis bloomberg    technical analysis bloomberg   FALSE
# 54              community organize              community organize              community organize   FALSE
# 55      monthly financial analysis      monthly financial analysis      monthly financial analysis   FALSE
# 56       expense variance analysis       expense variance analysis       expense variance analysis   FALSE
# 57                  stock analysis                  stock analysis                  stock analysis   FALSE

I added changed merely for a litmus, assuming you know which of your inputs should be different.

Walkthrough:

  1. Reduce is going to go over the whole column of skills for each of the spellings corrections. The input to one iteration of its function will be the output of the previous iteration, a necessary property so that we preserve the changes.

  2. Unfortunately, we can't easily use Vectorize here, and Reduce typically likes simple 2-argument functions (it isn't easily Map-able), so I break the spellings frame into a list of length-2 vectors:

    spellings_list <- asplit(spellings, 1)
    spellings_list
    # $`1`
    # preferred_spellings     other_spellings 
    #      "organisation"     " organization" 
    # $`2`
    # preferred_spellings     other_spellings 
    #   "acknowledgement"   " acknowledgment" 
    # $`3`
    # preferred_spellings     other_spellings 
    #            "cypher"           " cipher" 
    # $`4`
    # preferred_spellings     other_spellings 
    #       "anaesthesia"       " anesthesia" 
    # $`5`
    # preferred_spellings     other_spellings 
    #           "analyse"          " analyze" 
    

    This allows us to more easily use gsub(spl[1], spl[2], ...).

  3. The art of Reduce is knowing which argument to use where, and when to use init=. It's an art. When I put myself in a position where I doubt what is being fed where, I insert a browser() in the beginning of the anon-func and run through a couple of iterations of the reduction.

  4. Suggestion: you might want to sandwich your other_spellings with \\b on either side of its string, to protect against partial-match replacements. For example, your spellings will also replace organizational even though it is not literally present. While that one might be desired, depending on your larger list there could easily be false-positives. (E.g., color/colour and Colorado.)

(Edited: I originally swapped spl[1] and spl[2] in the gsub. Apparently there's also "logic" in the art of this :-)



来源:https://stackoverflow.com/questions/65568874/replace-text-using-values-from-lookup-table-without-for-loop

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!