I am trying to tokenize a sentence as follows.
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)
When I tokenize using tidytext and the code below,
AA <- df %>%
mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
locations = str_locate_all(df$Section, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations)
it gives me a result set as below (see image).
How do i get the comma and the period as independent tokens as not part of 'occurs,' and 'infusion.' respectively, using tidytext. so my tokens should be
Replace them with something else beforehand. Make sure to add a space before the replacement. Then split the sentences at spaces.
include = c(".", ",") #The symbols that should be included
mystr = Section # copy data
for (mypattern in include){
mystr = gsub(pattern = mypattern,
replacement = paste0(" ", mypattern),
x = mystr, fixed = TRUE)
lapply(strsplit(mystr, " "), function(V) data.frame(Tokens = V))
# Tokens
#1 If
#2 an
#3 infusion
#4 reaction
#5 occurs
#6 ,
#7 interrupt
#8 the
#9 infusion
#10 .
This will eventually increase the length of your string:
mutate(Section = gsub("([,.])",' \\1',Section),
start = gregexpr("\\S+",Section),
end = list(attr(start[[1]],"match.length")+unlist(start)),
Section = strsplit(Section,"\\s+"))%>%
Section start end
1 If 1 3
2 an 4 6
3 infusion 7 15
4 reaction 16 24
5 occurs 25 31
6 , 32 33
7 interrupt 34 43
8 the 44 47
9 infusion 48 56
10 . 57 58
Here's a way to do it without replacing anything first, the trick is to use the [[:punct:]]
wildcard which matches any of:
The pattern is simply \\w+|[[:punct:]]
- which says: match consecutive word characters, or a punctuation character, str_extract_all
takes care of the rest, pulling each one out separately. If you only wanted to split out specific punctuation marks, you can also just use \\w+|[,.]
or similar.
AA <- df %>% mutate(
tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations)
tokens start end
1 If 1 2
2 an 4 5
3 infusion 7 14
4 reaction 16 23
5 occurs 25 30
6 , 31 31
7 interrupt 33 41
8 the 43 45
9 infusion 47 54
10 . 55 55
The function unnest_tokens()
has a strip_punct
argument, for tokenizers such as the word tokenizer.
df %>%
unnest_tokens(word, Section, strip_punct = FALSE)
#> # A tibble: 10 x 1
#> word
#> <chr>
#> 1 if
#> 2 an
#> 3 infusion
#> 4 reaction
#> 5 occurs
#> 6 ,
#> 7 interrupt
#> 8 the
#> 9 infusion
#> 10 .
Created on 2018-08-15 by the reprex package (v0.2.0).