Parsing Interview Text | 易学教程

问题

I have a text file of a presidential debate. Eventually, I want to parse the text into a dataframe where each row is a statement, with one column with the speaker's name and another column with the statement. For example:

"Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"

Would become:

   name          text
1   Bob Smith    Hi Steve. How are you doing?
2 Steve Brown    Hi Bob. I'm doing well!

Question: How do I split the statements from the names? I tried splitting on the colon:

data <- strsplit(data, split=":")

But then I get this:

"Bob Smith" "Hi Steve. How are you doing? Steve Brown" "Hi Bob. I'm doing well!"

When what I want is this:

"Bob Smith" "Hi Steve. How are you doing?" "Steve Brown" "Hi Bob. I'm doing well!"

回答1:

I doubt this will fix all of your parsing needs, but an approach using strsplit to solve your most immediate question is using lookaround. You'll need to use perl regex though.

Here you instruct strsplit to split on either : or a space where there is a punctuation character immediately before and nothing but alphanumeric characters or spaces between the space and :. \\pP matches punctuation characters and \\w matches word characters.

data <- "Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"
strsplit(data,split="(: |(?<=\\pP) (?=[\\w ]+:))",perl=TRUE)
[[1]]
[1] "Bob Smith"                    "Hi Steve. How are you doing?" "Steve Brown"                 
[4] "Hi Bob. I'm doing well!"

回答2:

We can extract these with regex using the stringr package. You then directly have the columns of speaker and quote you are looking for.

a <- "Bob: Hi Steve. Steve: Hi Bob."

library(stringr)

str_match_all(a, "([A-Za-z]*?): (.*?\\.)")
#> [[1]]
#>      [,1]             [,2]    [,3]       
#> [1,] "Bob: Hi Steve." "Bob"   "Hi Steve."
#> [2,] "Steve: Hi Bob." "Steve" "Hi Bob."

来源：https://stackoverflow.com/questions/60778339/parsing-interview-text

标签

regex

tidyverse

stringr