问题
I read up on regular expressions and Hadley Wickham's stringr
and dplyr
packages but can't figure out how to get this to work.
I have library circulation data in a data frame, with the call number as a character variable. I'd like to take the initial capital letters and make that a new variable and the digits between the letters and period into a second new variable.
Call_Num
HV5822.H4 C47 Circulating Collection, 3rd Floor
QE511.4 .G53 1982 Circulating Collection, 3rd Floor
TL515 .M63 Circulating Collection, 3rd Floor
D753 .F4 Circulating Collection, 3rd Floor
DB89.F7 D4 Circulating Collection, 3rd Floor
回答1:
Using the stringi
package, this would be one option. Since your target stays at the beginning of the strings, stri_extract_first()
would work pretty well. [:alpha:]{1,}
indicates alphabet sequences which contain more than one alphabet. With stri_extract_first()
, you can identify the first alphabet sequence. Likewise, you can find the first sequence of numbers with stri_extract_first(x, regex = "\\d{1,}")
.
x <- c("HV5822.H4 C47 Circulating Collection, 3rd Floor",
"QE511.4 .G53 1982 Circulating Collection, 3rd Floor",
"TL515 .M63 Circulating Collection, 3rd Floor",
"D753 .F4 Circulating Collection, 3rd Floor",
"DB89.F7 D4 Circulating Collection, 3rd Floor")
library(stringi)
data.frame(alpha = stri_extract_first(x, regex = "[:alpha:]{1,}"),
number = stri_extract_first(x, regex = "\\d{1,}"))
# alpha number
#1 HV 5822
#2 QE 511
#3 TL 515
#4 D 753
#5 DB 89
回答2:
what about
rl <- read.table(header = TRUE, text = "Call_Num
'HV5822.H4 C47 Circulating Collection, 3rd Floor'
'QE511.4 .G53 1982 Circulating Collection, 3rd Floor'
'TL515 .M63 Circulating Collection, 3rd Floor'
'D753 .F4 Circulating Collection, 3rd Floor'
'DB89.F7 D4 Circulating Collection, 3rd Floor'",
stringsAsFactors = FALSE)
cbind(rl, read.table(text = gsub('([A-Z]+)([0-9]+).*', '\\1 \\2', rl$Call_Num)))
# Call_Num V1 V2
# 1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
# 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511
# 3 TL515 .M63 Circulating Collection, 3rd Floor TL 515
# 4 D753 .F4 Circulating Collection, 3rd Floor D 753
# 5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89
回答3:
If you want to use stringr
, the solution would probably look something like this:
df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))
require(stringr)
matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])
df2
## Call_Num letter number
## 1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
## 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511
## 3 TL515 .M63 Circulating Collection, 3rd Floor TL 515
## 4 D753 .F4 Circulating Collection, 3rd Floor D 753
## 5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89
I don't think that sticking the str_match()
call into mutate()
of dplyr
is worth the effort, so I'd just leave it at that. Or use rawr's solution.
回答4:
You can use strapply from the gsubfn package:
library(gsubfn)
m <- strapply(as.character(df$Call_Num), '^([A-Z]+)(\\d+)',
~ c(id = x, num = y), simplify = rbind)
X <- as.data.frame(m, stringsAsFactors = FALSE)
# id num
# 1 HV 5822
# 2 QE 511
# 3 TL 515
# 4 D 753
# 5 DB 89
来源:https://stackoverflow.com/questions/31259619/create-new-variables-based-upon-specific-values