Tokenizing issue | 易学教程

问题

I am trying to tokenize a sentence as follows.

Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)

When I tokenize using tidytext and the code below,

AA <- df %>%
  mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
         locations = str_locate_all(df$Section, "([^\\s]+)"),
         locations = map(locations, as.data.frame)) %>%
  select(-Section) %>%
  unnest(tokens, locations)

it gives me a result set as below (see image).

How do i get the comma and the period as independent tokens as not part of 'occurs,' and 'infusion.' respectively, using tidytext. so my tokens should be

If
an
infusion
reaction
occurs
,
interrupt
the
infusion
.

回答1:

Replace them with something else beforehand. Make sure to add a space before the replacement. Then split the sentences at spaces.

include = c(".", ",") #The symbols that should be included

mystr = Section  # copy data
for (mypattern in include){
    mystr = gsub(pattern = mypattern,
                 replacement = paste0(" ", mypattern),
                 x = mystr, fixed = TRUE)
}
lapply(strsplit(mystr, " "), function(V) data.frame(Tokens = V))
#[[1]]
#      Tokens
#1         If
#2         an
#3   infusion
#4   reaction
#5     occurs
#6          ,
#7  interrupt
#8        the
#9   infusion
#10         .

回答2:

This will eventually increase the length of your string:

df%>%
  mutate(Section =  gsub("([,.])",' \\1',Section),
  start = gregexpr("\\S+",Section),
  end = list(attr(start[[1]],"match.length")+unlist(start)),
  Section = strsplit(Section,"\\s+"))%>%
  unnest()

     Section start end
1         If     1   3
2         an     4   6
3   infusion     7  15
4   reaction    16  24
5     occurs    25  31
6          ,    32  33
7  interrupt    34  43
8        the    44  47
9   infusion    48  56
10         .    57  58

回答3:

Here's a way to do it without replacing anything first, the trick is to use the [[:punct:]] wildcard which matches any of:

!"#$%&'()*+,\-./:;<=>?@[\]^_`{|}~

The pattern is simply \\w+|[[:punct:]]- which says: match consecutive word characters, or a punctuation character, str_extract_all takes care of the rest, pulling each one out separately. If you only wanted to split out specific punctuation marks, you can also just use \\w+|[,.] or similar.

AA <- df %>% mutate(
     tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
     locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
     locations = map(locations, as.data.frame)) %>%
  select(-Section) %>%
  unnest(tokens, locations)

      tokens start end
1         If     1   2
2         an     4   5
3   infusion     7  14
4   reaction    16  23
5     occurs    25  30
6          ,    31  31
7  interrupt    33  41
8        the    43  45
9   infusion    47  54
10         .    55  55

回答4:

The function unnest_tokens() has a strip_punct argument, for tokenizers such as the word tokenizer.

library(tidyverse)
library(tidytext)

df %>%
  unnest_tokens(word, Section, strip_punct = FALSE)
#> # A tibble: 10 x 1
#>    word     
#>    <chr>    
#>  1 if       
#>  2 an       
#>  3 infusion 
#>  4 reaction 
#>  5 occurs   
#>  6 ,        
#>  7 interrupt
#>  8 the      
#>  9 infusion 
#> 10 .

Created on 2018-08-15 by the reprex package (v0.2.0).

来源：https://stackoverflow.com/questions/51850625/tokenizing-issue

标签

regex

tokenize

tidytext