问题
I am working with a text document similar to the examples below.
File <- c("Location Name Code and Label Frequency Percentage",
" During the past 30 days, on how many days did you carry a weapon",
"44-44 Q13 such as a gun, knife, or club on school property?",
" 1 0 days 1,610 94.5",
" 2 1 day 71 4.3",
" 3 2 or 3 days 6 0.4",
" 4 4 or 5 days 3 0.2",
" 5 6 or more days 12 0.7",
" Missing 48",
"45-45 Q14 During the past 12 months, on how many days did you carry a gun?",
" 1 0 days 1,602 91.3",
" 2 1 day 84 5.0",
" 3 2 or 3 days 17 1.2",
" 4 4 or 5 days 6 0.3",
" 5 6 or more days 38 2.2",
" Missing 3",
" During the past 30 days, on how many days did you not go to school",
"46-46 Q15 because you felt you would be unsafe at school or on your way to or",
" from school?", " 1 0 days 1,407 80.4",
" 2 1 day 180 10.9",
" 3 2 or 3 days 97 5.4",
" 4 4 or 5 days 31 1.8",
" 5 6 or more days 26 1.5",
" Missing 9",
" During the past 12 months, how many times has someone threatened",
"47-47 Q16 or injured you with a weapon such as a gun, knife, or club on school",
" property?", " 1 0 times 1,590 92.5",
" 2 1 time 93 5.7",
" 3 2 or 3 times 10 0.7",
" 4 4 or 5 times 9 0.4",
" 5 6 or 7 times 6 0.3",
" 6 8 or 9 times 0 0.0",
" 7 10 or 11 times 3 0.2",
" 8 12 or more times 2 0.1",
" Missing 37",
" 4",
"")
From the above text I want to create another document like the below result:
Desired_Result <- c(
"q13: such as a gun, knife, or club on school property?" = "q13",
"q14: During the past 12 months, on how many days did you carry a gun?" = "q14",
"q15: because you felt you would be unsafe at school or on your way to or" = "q15",
"q16: or injured you with a weapon such as a gun, knife, or club on school" = "q16",
)
Nevertheless, q13, q15 and q16 are not complete questions because the rest of the questions lines are above or below the selected line with a regular expression.
QUESTION:
My question is how can I select the above or below lines of a regular expression selection while meeting another regular expression criteria and then adequately concatenate them?
I accomplished the Desired_Result above using the following code:
Qs_Lines <- grep("[a-zA-Z]*Q[0-9][0-9]?", File, perl = TRUE, value = TRUE)
Qs_Lines <- str_trim(Qs_Lines)
Qs_Lines
# Extract Q ----
Qs <- Qs_Lines %>% str_extract("Q([0-9]){1,2}")
Qs
# Extract text after the Q[0-9][0-9]
Info_Lines <- str_extract(Qs_Lines, "[:blank:]+[a-zA-Z][a-zA-Z].*") %>% str_trim
Info_Lines
# Select lines before Qs if the sentence in Q lines is not complete
# Line_Before_Qs <- str_subset(File, "^\\s{18,19}[A-Z][a-z]") %>% str_trim()
# Line_Before_Qs <- Line_Before_Qs[1:100]
# Paste expression results and text
Final <- paste0("\"", tolower(Qs), ": ", Info_Lines, "\"", " = ", " \"", tolower(Qs), "\"", ",", sep = "" )
# Include a parentheses to enclose the result = c(XX) --------------------
Final <- c("c(", Final, ")")
# WriteLines is a function to help se ethe end result ----------------------
writeLines(
Final
)
Bellow I include two unsuccessful code of some trials. I think they can help in getting the correct result.
Thanks a lot for your help
And the best in this New Year 2020
############# For loop with if #################
line_count <- length(File)
q_Line <- ""
before_q_Line <- ""
question <- ""
# For loop
for (i in 1:line_count){
if (str_detect(File[i], "\\d*-\\d*\\s*Q.\\s*") == TRUE | str_detect(File[i], "\\d*-\\d*\\s*QN.\\s*") == TRUE ) {
q_Line[i] <- File[i]
}
if(str_detect(File[i], pattern = "^\\s{18,19}[A-Z][a-z]") == TRUE){
before_q_Line[i] <- File[i]
}
}
question <- paste(before_q_Line, q_Line)
question
###############End of For loop with if ####################
Another try
############ for loop with if and while #############
for (i in 1:line_count){
if (str_detect(File[i], "\\d*-\\d*\\s*Q.\\s*") == TRUE ) {
q_line[i] <- File[i]
}
prior <- i-1
while(str_detect(File[prior], pattern = "^\\s{18,19}[A-Z][a-z]") == TRUE){
before_question [i]<- File[i-1]
}
question[i] <- str_glue(question[i], File[prior], sep = " ")
}
################ End of for loop with if and while ######################
回答1:
This works, starting only with File, and without any other dependencies. It uses grep to return indices rather than strings. This way it can optionally include the previous line if it doesn't include the word "Missing", and the following line if it doesn't start with a numeral.
Qs <- unlist(lapply(grep("[a-zA-Z]*Q[0-9][0-9]?", File, perl = TRUE),
function(x)
{
if(grepl(" +[0-9]", File[x + 1])) postfix <- "" else postfix <- File[x + 1]
if (grepl("Missing", File[x - 1])) prefix <- "" else prefix <- File[x - 1]
return(paste(prefix, File[x], postfix, sep = " "))
}))
Qs <- unlist(lapply(strsplit(Qs, "( )+"), function(x)
{
question <- gsub(" ", "", x[grep("Q[0-9][0-9]?", x)])
text <- paste(question,":", paste(x[nchar(x) > 6], collapse = " "))
names(text) <- question
return(text)
}))
Qs <- gsub(" +", " ", Qs)
This gives you a named vector, with the names being Q13 - Q16 and the text being the whole question. I think that's what you were looking for.
Qs
#> Q13
#> "Q13 : During the past 30 days, on how many days did you carry a weapon such as a gun, knife, or club on school property?"
#>
#> Q14
#> "Q14 : During the past 12 months, on how many days did you carry a gun?"
#>
#> Q15
#> "Q15 : During the past 30 days, on how many days did you not go to school because you felt you would be unsafe at school or on your way to or from school?"
#>
#> Q16
#> "Q16 : During the past 12 months, how many times has someone threatened or injured you with a weapon such as a gun, knife, or club on school property?"
来源:https://stackoverflow.com/questions/59536933/regular-expression-r-select-the-above-or-below-lines-of-a-regexp-selection-whil