Counting whole word/number occurrences with str_count in R

我的梦境 提交于 2019-12-20 02:56:38

问题


Similar to this case, i would like to count the number of occurrences of multiple words and numbers that occur in a vector of sentences with str_count of the stringr package.

But I noticed that not only whole numbers are counted but also partial numbers. For example:

df <- c("honda civic 1988 with new lights","toyota auris 4x4 140000 km","nissan skyline 2.0 159000 km")
keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
library(stringr)
number_of_keywords_df <- str_count(df, paste(keywords, collapse='|'))

Here I recieve a vector for number_of_keywords_df of 3, 3, 3 while clearly, it should be 3, 2, 2. The str_count function seems to count the partial strings "1400" and "159" within the numbers "140000" and "159000". Is there any way of preventing that?


回答1:


Using sprintf you can add word boundaries:

number_of_keywords_df <- str_count(df, paste(sprintf("\\b%s\\b", keywords), collapse = '|'))
number_of_keywords_df

Which yields

[1] 3 2 2



回答2:


Try putting word boundaries around your keywords:

keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
keywords <- paste0("\\b", keywords, "\\b")

In regex lingo, \bhonda\b says to match the isolated word honda. Hence hondas would not match because it has an extra letter at the end.



来源:https://stackoverflow.com/questions/49257263/counting-whole-word-number-occurrences-with-str-count-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!