Create new variables based upon specific values

安稳与你 提交于 2019-12-05 10:37:20

Using the stringi package, this would be one option. Since your target stays at the beginning of the strings, stri_extract_first() would work pretty well. [:alpha:]{1,} indicates alphabet sequences which contain more than one alphabet. With stri_extract_first(), you can identify the first alphabet sequence. Likewise, you can find the first sequence of numbers with stri_extract_first(x, regex = "\\d{1,}").

x <- c("HV5822.H4 C47 Circulating Collection, 3rd Floor",
       "QE511.4 .G53 1982 Circulating Collection, 3rd Floor",
       "TL515 .M63 Circulating Collection, 3rd Floor",
       "D753 .F4 Circulating Collection, 3rd Floor",
       "DB89.F7 D4 Circulating Collection, 3rd Floor")

library(stringi)

data.frame(alpha = stri_extract_first(x, regex = "[:alpha:]{1,}"), 
           number = stri_extract_first(x, regex = "\\d{1,}"))

#  alpha number
#1    HV   5822
#2    QE    511
#3    TL    515
#4     D    753
#5    DB     89

what about

rl <- read.table(header = TRUE, text = "Call_Num
'HV5822.H4 C47 Circulating Collection, 3rd Floor'
                 'QE511.4 .G53 1982 Circulating Collection, 3rd Floor'
                 'TL515 .M63 Circulating Collection, 3rd Floor'
                 'D753 .F4 Circulating Collection, 3rd Floor'
                 'DB89.F7 D4 Circulating Collection, 3rd Floor'",
                 stringsAsFactors = FALSE)
cbind(rl, read.table(text = gsub('([A-Z]+)([0-9]+).*', '\\1 \\2', rl$Call_Num)))

#                                              Call_Num V1   V2
# 1     HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
# 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE  511
# 3        TL515 .M63 Circulating Collection, 3rd Floor TL  515
# 4          D753 .F4 Circulating Collection, 3rd Floor  D  753
# 5        DB89.F7 D4 Circulating Collection, 3rd Floor DB   89
Claus Wilke

If you want to use stringr, the solution would probably look something like this:

df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))

require(stringr)

matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])
df2
##                                                  Call_Num letter number
## 1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
## 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
## 3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
## 4          D753 .F4 Circulating Collection, 3rd Floor      D    753
## 5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89

I don't think that sticking the str_match() call into mutate() of dplyr is worth the effort, so I'd just leave it at that. Or use rawr's solution.

You can use strapply from the gsubfn package:

library(gsubfn)

m <- strapply(as.character(df$Call_Num), '^([A-Z]+)(\\d+)', 
     ~ c(id = x, num = y), simplify = rbind)

X <- as.data.frame(m, stringsAsFactors = FALSE)

#   id  num
# 1 HV 5822
# 2 QE  511
# 3 TL  515
# 4  D  753
# 5 DB   89
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!