How to count CAPSLOCK in string using R

十年热恋 提交于 2019-12-05 09:03:50

问题


In src$Review each row is filled with text in Russian. I want to count the CAPSLOCK in each row. For example, in "My apple is GREEN" I want to count not just the quantity of capital letters, but the amount of CAPSLOCK (without "My", only "GREEN"). So, it works only if at least two characters are presented in uppercase.

Now I have following code in my script:

capscount <- str_count(src$Review, "[А-Я]")

It counts only the total amount of capital letters. I only need the total amount of characters that are in CAPSLOCK, which means that these characters are counted only if at least 2 following letters in a word (e.g., "GR" in "GREEN") are displayed.

Thank you in advance.


回答1:


The pattern you are looking for is "\\b[A-Z]{2,}\\b". It will match on two or more capital letters in succession that have boundaries, \\b, on each side. That is the overall structure, fill in with the Russian alphabet where necessary.

#test string. A correct count should be 1 0 2
x <- c("My GREEN", "My Green", "MY GREEN")

library(stringr)
str_count(x, "\\b[A-Z]{2,}\\b")
#[1] 1 0 2

library(stringi)
stri_count(x, regex="\\b[A-Z]{2,}\\b")
#[1] 1 0 2

#base R
sapply(gregexpr("\\b[A-Z]{2,}\\b", x), function(x) length(c(x[x > 0])))
#[1] 1 0 2

Update

If you would like character counts:

sapply(str_match_all(x, "\\b[A-Z]{2,}\\b"), nchar)



回答2:


Use Pierre's regex with nchar and str_extract_all. Use simplify = TRUE and paste0 to concatenate all the instances.

library(stringr)

string <- c("My applie is GREEN and Her Majesty's apricot is ORANGE", "I have a LARGE sword", "My baby is sick")

nchar(
  paste0(
    str_extract_all(string = string, pattern = "\\b[A-Z]{2,}\\b", simplify = TRUE), 
    collapse = "")
  )



回答3:


The qdapRegex package I maintain has a regular expression for this, which is the same as @Hugh's regex but IMO it's nice to have lots of common regexes stored in a library that I can just grab. qdapRegex uses stringi as the backend and so should be available if you've installed qdapRegex.

On @Pierre Lafortune's string:

x <- c("My GREEN", "My Green", "MY GREEN")

library(qdapRegex)
stringi::stri_count_regex(x, grab("@rm_caps"))

## [1] 1 0 2

Let's look at the regex:

grab("@rm_caps")

## "(\\b[A-Z]{2,}\\b)"

On @Hugh's string:

x2 <- c("My applie is GREEN and Her Majesty's apricot is ORANGE", "I have a LARGE sword", "My baby is sick")
stringi::stri_count_regex(x2, grab("@rm_caps"))

## [1] 2 1 0


来源:https://stackoverflow.com/questions/33197733/how-to-count-capslock-in-string-using-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!