问题
I'm writing a function for spelling correction. I scraped spelling variants page from wikipedia and converted it into a table. I want to now use this as lookup table (spellings) and replace values in my documents (skills.db). NOTE: skills data frame below is just an example. ignore the second column. I will be performing the spelling correction much earlier in the process on resumes. resumes are large, so i thought I'll share this instead.
I can do this using a for loop as below, however I'm wondering if there's a better solution
spellings = structure(list(preferred_spellings = c("organisation", "acknowledgement",
"cypher", "anaesthesia", "analyse"), other_spellings = c(" organization",
" acknowledgment", " cipher", " anesthesia", " analyze")), row.names = c(NA,
5L), class = "data.frame")
skills.db = structure(list(skills = c("variance analysis static", "analyze kpi",
"financial analysis", "variance analysis", "organizational",
"analysis", "organize", "result analysis", "analytic", "datum analysis",
"analytics", "business analysis", "organized", "quantitative analysis",
"train need analysis", "analytic think", "analysis trial preparation",
"analyze statue", "google analytics", "service analysis", "organize individual",
"account analysis", "analyze department work", "pareto analysis train",
"organization", "ratio analysis", "statistical analysis", "project organization",
"organize client's file", "with good analytic", "nielsen analytics",
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics",
"market analysis", "analyse", "analytic skill", "superb analytic",
"financial statement analysis", "credit analysis", "quick analysis",
"organizational development", "outstanding financial analytic",
"organization design", "organize conference", "business analytics",
"industry analysis", "fs analysis", "analyze", "cash flow analysis",
"investment analysis", "technical analysis bloomberg", "community organize",
"monthly financial analysis", "expense variance analysis", "stock analysis"
), level1 = c("variance analysis static", "analyze kpi", "financial analysis",
"variance analysis", "organizational", "analysis", "organize",
"result analysis", "analytic", "datum analysis", "analytics",
"business analysis", "organized", "quantitative analysis", "train need analysis",
"analytic think", "analysis trial preparation", "analyze statue",
"google analytics", "service analysis", "organize individual",
"account analysis", "analyze department work", "pareto analysis train",
"organization", "ratio analysis", "statistical analysis", "project organization",
"organize client's file", "with good analytic", "nielsen analytics",
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics",
"market analysis", "analyse", "analytic skill", "superb analytic",
"financial statement analysis", "credit analysis", "quick analysis",
"organizational development", "outstanding financial analytic",
"organization design", "organize conference", "business analytics",
"industry analysis", "fs analysis", "analyze", "cash flow analysis",
"investment analysis", "technical analysis bloomberg", "community organize",
"monthly financial analysis", "expense variance analysis", "stock analysis"
)), row.names = c(49L, 65L, 77L, 82L, 155L, 190L, 215L, 244L,
246L, 260L, 287L, 300L, 311L, 323L, 349L, 356L, 378L, 386L, 447L,
607L, 622L, 664L, 686L, 766L, 824L, 832L, 895L, 922L, 928L, 949L,
1020L, 1054L, 1079L, 1080L, 1081L, 1088L, 1146L, 1158L, 1228L,
1248L, 1319L, 1366L, 1385L, 1440L, 1468L, 1475L, 1509L, 1554L,
1584L, 1606L, 1635L, 1658L, 1660L, 1696L, 1760L, 1762L, 1798L
), class = "data.frame")
for(i in 1:nrow(spellings)){
skills.db = skills.db %>% mutate(TEST = gsub(spellings$other_spellings[i], spellings$preferred_spellings[i], skills))
}
回答1:
Here's one method, using Reduce
(which could easily be purrr::reduce
) to iterate over each of the spellings and correct them.
spellings_list <- asplit(spellings, 1)
skills.db %>%
mutate(TEST = Reduce(function(txt, spl) gsub(spl[2], spl[1], txt), spellings_list, init = skills), changed = (skills != TEST))
# skills level1 TEST changed
# 1 variance analysis static variance analysis static variance analysis static FALSE
# 2 analyze kpi analyze kpi analyse kpi TRUE
# 3 financial analysis financial analysis financial analysis FALSE
# 4 variance analysis variance analysis variance analysis FALSE
# 5 organizational organizational organisational TRUE
# 6 analysis analysis analysis FALSE
# 7 organize organize organize FALSE
# 8 result analysis result analysis result analysis FALSE
# 9 analytic analytic analytic FALSE
# 10 datum analysis datum analysis datum analysis FALSE
# 11 analytics analytics analytics FALSE
# 12 business analysis business analysis business analysis FALSE
# 13 organized organized organized FALSE
# 14 quantitative analysis quantitative analysis quantitative analysis FALSE
# 15 train need analysis train need analysis train need analysis FALSE
# 16 analytic think analytic think analytic think FALSE
# 17 analysis trial preparation analysis trial preparation analysis trial preparation FALSE
# 18 analyze statue analyze statue analyse statue TRUE
# 19 google analytics google analytics google analytics FALSE
# 20 service analysis service analysis service analysis FALSE
# 21 organize individual organize individual organize individual FALSE
# 22 account analysis account analysis account analysis FALSE
# 23 analyze department work analyze department work analyse department work TRUE
# 24 pareto analysis train pareto analysis train pareto analysis train FALSE
# 25 organization organization organisation TRUE
# 26 ratio analysis ratio analysis ratio analysis FALSE
# 27 statistical analysis statistical analysis statistical analysis FALSE
# 28 project organization project organization project organisation TRUE
# 29 organize client's file organize client's file organize client's file FALSE
# 30 with good analytic with good analytic with good analytic FALSE
# 31 nielsen analytics nielsen analytics nielsen analytics FALSE
# 32 datum analytics datum analytics datum analytics FALSE
# 33 textual analytics textual analytics textual analytics FALSE
# 34 social analytics social analytics social analytics FALSE
# 35 business intelligence analytics business intelligence analytics business intelligence analytics FALSE
# 36 market analysis market analysis market analysis FALSE
# 37 analyse analyse analyse FALSE
# 38 analytic skill analytic skill analytic skill FALSE
# 39 superb analytic superb analytic superb analytic FALSE
# 40 financial statement analysis financial statement analysis financial statement analysis FALSE
# 41 credit analysis credit analysis credit analysis FALSE
# 42 quick analysis quick analysis quick analysis FALSE
# 43 organizational development organizational development organisational development TRUE
# 44 outstanding financial analytic outstanding financial analytic outstanding financial analytic FALSE
# 45 organization design organization design organisation design TRUE
# 46 organize conference organize conference organize conference FALSE
# 47 business analytics business analytics business analytics FALSE
# 48 industry analysis industry analysis industry analysis FALSE
# 49 fs analysis fs analysis fs analysis FALSE
# 50 analyze analyze analyse TRUE
# 51 cash flow analysis cash flow analysis cash flow analysis FALSE
# 52 investment analysis investment analysis investment analysis FALSE
# 53 technical analysis bloomberg technical analysis bloomberg technical analysis bloomberg FALSE
# 54 community organize community organize community organize FALSE
# 55 monthly financial analysis monthly financial analysis monthly financial analysis FALSE
# 56 expense variance analysis expense variance analysis expense variance analysis FALSE
# 57 stock analysis stock analysis stock analysis FALSE
I added changed
merely for a litmus, assuming you know which of your inputs should be different.
Walkthrough:
Reduce
is going to go over the whole column ofskills
for each of the spellings corrections. The input to one iteration of its function will be the output of the previous iteration, a necessary property so that we preserve the changes.Unfortunately, we can't easily use
Vectorize
here, andReduce
typically likes simple 2-argument functions (it isn't easilyMap
-able), so I break thespellings
frame into a list of length-2 vectors:spellings_list <- asplit(spellings, 1) spellings_list # $`1` # preferred_spellings other_spellings # "organisation" " organization" # $`2` # preferred_spellings other_spellings # "acknowledgement" " acknowledgment" # $`3` # preferred_spellings other_spellings # "cypher" " cipher" # $`4` # preferred_spellings other_spellings # "anaesthesia" " anesthesia" # $`5` # preferred_spellings other_spellings # "analyse" " analyze"
This allows us to more easily use
gsub(spl[1], spl[2], ...)
.The art of
Reduce
is knowing which argument to use where, and when to useinit=
. It's an art. When I put myself in a position where I doubt what is being fed where, I insert abrowser()
in the beginning of the anon-func and run through a couple of iterations of the reduction.Suggestion: you might want to sandwich your
other_spellings
with\\b
on either side of its string, to protect against partial-match replacements. For example, yourspellings
will also replaceorganizational
even though it is not literally present. While that one might be desired, depending on your larger list there could easily be false-positives. (E.g.,color
/colour
andColorado
.)
(Edited: I originally swapped spl[1]
and spl[2]
in the gsub
. Apparently there's also "logic" in the art of this :-)
来源:https://stackoverflow.com/questions/65568874/replace-text-using-values-from-lookup-table-without-for-loop