Regular Expression R: Select the above or below lines of a regexp selection while meeting another regexp criteria

问题

I am working with a text document similar to the examples below.

File <- c("Location  Name                               Code and Label                            Frequency  Percentage", 
"                  During the past 30 days, on how many days did you carry a weapon", 
"44-44     Q13     such as a gun, knife, or club on school property?", 
"                  1                  0 days                                               1,610        94.5", 
"                  2                  1 day                                                   71         4.3", 
"                  3                  2 or 3 days                                              6         0.4", 
"                  4                  4 or 5 days                                              3         0.2", 
"                  5                  6 or more days                                          12         0.7", 
"                                     Missing                                                 48", 
"45-45     Q14     During the past 12 months, on how many days did you carry a gun?", 
"                  1                  0 days                                               1,602        91.3", 
"                  2                  1 day                                                   84         5.0", 
"                  3                  2 or 3 days                                             17         1.2", 
"                  4                  4 or 5 days                                              6         0.3", 
"                  5                  6 or more days                                          38         2.2", 
"                                     Missing                                                  3", 
"                  During the past 30 days, on how many days did you not go to school", 
"46-46     Q15     because you felt you would be unsafe at school or on your way to or", 
"                  from school?", "                  1                  0 days                                               1,407        80.4", 
"                  2                  1 day                                                  180        10.9", 
"                  3                  2 or 3 days                                             97         5.4", 
"                  4                  4 or 5 days                                             31         1.8", 
"                  5                  6 or more days                                          26         1.5", 
"                                     Missing                                                  9", 
"                  During the past 12 months, how many times has someone threatened", 
"47-47     Q16     or injured you with a weapon such as a gun, knife, or club on school", 
"                  property?", "                  1                  0 times                                              1,590        92.5", 
"                  2                  1 time                                                  93         5.7", 
"                  3                  2 or 3 times                                            10         0.7", 
"                  4                  4 or 5 times                                             9         0.4", 
"                  5                  6 or 7 times                                             6         0.3", 
"                  6                  8 or 9 times                                             0         0.0", 
"                  7                  10 or 11 times                                           3         0.2", 
"                  8                  12 or more times                                         2         0.1", 
"                                     Missing                                                 37", 
"                                                                                                             4", 
"")

From the above text I want to create another document like the below result:

Desired_Result <- c(
"q13: such as a gun, knife, or club on school property?" =  "q13",
"q14: During the past 12 months, on how many days did you carry a gun?" =  "q14",
"q15: because you felt you would be unsafe at school or on your way to or" =  "q15",
"q16: or injured you with a weapon such as a gun, knife, or club on school" =  "q16",
)

Nevertheless, q13, q15 and q16 are not complete questions because the rest of the questions lines are above or below the selected line with a regular expression.

QUESTION:

My question is how can I select the above or below lines of a regular expression selection while meeting another regular expression criteria and then adequately concatenate them?

I accomplished the Desired_Result above using the following code:

Qs_Lines <- grep("[a-zA-Z]*Q[0-9][0-9]?", File, perl = TRUE, value = TRUE)
Qs_Lines <- str_trim(Qs_Lines)
Qs_Lines

# Extract Q ----
Qs <- Qs_Lines %>% str_extract("Q([0-9]){1,2}")
Qs

# Extract text after the Q[0-9][0-9]
Info_Lines <- str_extract(Qs_Lines, "[:blank:]+[a-zA-Z][a-zA-Z].*") %>% str_trim
Info_Lines

# Select lines before Qs if the sentence in Q lines is not complete

# Line_Before_Qs <-  str_subset(File, "^\\s{18,19}[A-Z][a-z]") %>% str_trim()
# Line_Before_Qs <-  Line_Before_Qs[1:100]


# Paste expression results and text
Final <-  paste0("\"", tolower(Qs), ": ", Info_Lines, "\"", " = ", " \"", tolower(Qs), "\"", ",", sep = "" )

# Include a parentheses to enclose the result  = c(XX)  --------------------

Final <- c("c(", Final, ")")

# WriteLines is a function to help se ethe end result ----------------------

writeLines(
Final
)

Bellow I include two unsuccessful code of some trials. I think they can help in getting the correct result.

Thanks a lot for your help

And the best in this New Year 2020

############# For loop with if #################
line_count <- length(File)
q_Line <- ""
before_q_Line <- ""
question <- ""

# For loop
for (i in 1:line_count){

  if (str_detect(File[i], "\\d*-\\d*\\s*Q.\\s*") == TRUE  | str_detect(File[i], "\\d*-\\d*\\s*QN.\\s*") == TRUE ) {

    q_Line[i] <- File[i]
  }

  if(str_detect(File[i], pattern = "^\\s{18,19}[A-Z][a-z]") == TRUE){

    before_q_Line[i] <- File[i]
  } 
}

question <- paste(before_q_Line, q_Line)

question
###############End of For loop with if ####################

Another try

############ for loop with if and while #############
for (i in 1:line_count){

  if (str_detect(File[i], "\\d*-\\d*\\s*Q.\\s*") == TRUE ) {

    q_line[i] <- File[i]
  }

prior <- i-1
    while(str_detect(File[prior], pattern = "^\\s{18,19}[A-Z][a-z]") == TRUE){

      before_question [i]<- File[i-1]

    }

question[i] <- str_glue(question[i], File[prior], sep = " ")
}
################ End of for loop with if and while ######################

回答1:

This works, starting only with File, and without any other dependencies. It uses grep to return indices rather than strings. This way it can optionally include the previous line if it doesn't include the word "Missing", and the following line if it doesn't start with a numeral.

Qs <- unlist(lapply(grep("[a-zA-Z]*Q[0-9][0-9]?", File, perl = TRUE),
function(x)
{
  if(grepl(" +[0-9]", File[x + 1])) postfix <- "" else postfix <- File[x + 1]
  if (grepl("Missing", File[x - 1])) prefix <- "" else prefix <- File[x - 1]
  return(paste(prefix, File[x], postfix, sep = "  "))
}))

Qs <- unlist(lapply(strsplit(Qs, "(  )+"), function(x)
{
  question <- gsub(" ", "", x[grep("Q[0-9][0-9]?", x)])
  text <- paste(question,":", paste(x[nchar(x) > 6], collapse = " "))
  names(text) <- question
  return(text)
}))

Qs <- gsub(" +", " ", Qs)

This gives you a named vector, with the names being Q13 - Q16 and the text being the whole question. I think that's what you were looking for.

Qs
#> Q13 
#> "Q13 : During the past 30 days, on how many days did you carry a weapon such as a gun, knife, or club on school property?" 
#> 
#> Q14
#> "Q14 : During the past 12 months, on how many days did you carry a gun?" 
#>
#> Q15 
#> "Q15 : During the past 30 days, on how many days did you not go to school because you felt you would be unsafe at school or on your way to or from school?" 
#>
#> Q16 
#> "Q16 : During the past 12 months, how many times has someone threatened or injured you with a weapon such as a gun, knife, or club on school property?"

来源：https://stackoverflow.com/questions/59536933/regular-expression-r-select-the-above-or-below-lines-of-a-regexp-selection-whil

标签

regex

stringr