问题
I am trying to tokenize a sentence as follows.
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)
When I tokenize using tidytext and the code below,
AA <- df %>%
mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
locations = str_locate_all(df$Section, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations)
it gives me a result set as below (see image).
How do i get the comma and the period as independent tokens as not part of 'occurs,' and 'infusion.' respectively, using tidytext. so my tokens should be
If
an
infusion
reaction
occurs
,
interrupt
the
infusion
.
回答1:
Replace them with something else beforehand. Make sure to add a space before the replacement. Then split the sentences at spaces.
include = c(".", ",") #The symbols that should be included
mystr = Section # copy data
for (mypattern in include){
mystr = gsub(pattern = mypattern,
replacement = paste0(" ", mypattern),
x = mystr, fixed = TRUE)
}
lapply(strsplit(mystr, " "), function(V) data.frame(Tokens = V))
#[[1]]
# Tokens
#1 If
#2 an
#3 infusion
#4 reaction
#5 occurs
#6 ,
#7 interrupt
#8 the
#9 infusion
#10 .
回答2:
This will eventually increase the length of your string:
df%>%
mutate(Section = gsub("([,.])",' \\1',Section),
start = gregexpr("\\S+",Section),
end = list(attr(start[[1]],"match.length")+unlist(start)),
Section = strsplit(Section,"\\s+"))%>%
unnest()
Section start end
1 If 1 3
2 an 4 6
3 infusion 7 15
4 reaction 16 24
5 occurs 25 31
6 , 32 33
7 interrupt 34 43
8 the 44 47
9 infusion 48 56
10 . 57 58
回答3:
Here's a way to do it without replacing anything first, the trick is to use the [[:punct:]]
wildcard which matches any of:
!"#$%&'()*+,\-./:;<=>?@[\]^_`{|}~
The pattern is simply \\w+|[[:punct:]]
- which says: match consecutive word characters, or a punctuation character, str_extract_all
takes care of the rest, pulling each one out separately. If you only wanted to split out specific punctuation marks, you can also just use \\w+|[,.]
or similar.
AA <- df %>% mutate(
tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations)
tokens start end
1 If 1 2
2 an 4 5
3 infusion 7 14
4 reaction 16 23
5 occurs 25 30
6 , 31 31
7 interrupt 33 41
8 the 43 45
9 infusion 47 54
10 . 55 55
回答4:
The function unnest_tokens()
has a strip_punct
argument, for tokenizers such as the word tokenizer.
library(tidyverse)
library(tidytext)
df %>%
unnest_tokens(word, Section, strip_punct = FALSE)
#> # A tibble: 10 x 1
#> word
#> <chr>
#> 1 if
#> 2 an
#> 3 infusion
#> 4 reaction
#> 5 occurs
#> 6 ,
#> 7 interrupt
#> 8 the
#> 9 infusion
#> 10 .
Created on 2018-08-15 by the reprex package (v0.2.0).
来源:https://stackoverflow.com/questions/51850625/tokenizing-issue