I am trying to use unnest_tokens with spanish text. It works fine with unigrams, but breaks the special characters with bigrams.
The code works fine on Linux. I added some info on the locale.
library(tidytext)
library(dplyr)
df <- data_frame(
text = "César Moreira Nuñez"
)
# works ok:
df %>%
unnest_tokens(word, text)
# # A tibble: 3 x 1
# word
# <chr>
# 1 césar
# 2 moreira
# 3 nuñez
# breaks é and ñ
df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2 )
# # A tibble: 2 x 1
# bigram
# <chr>
# 1 cã©sar moreira
# 2 moreira nuã±ez
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
It seems that it happens when you change the token
argument to ngrams
. I m not sure why it does that, but here is a work around using package qlcMatrix
library(qlcMatrix)
splitStrings(df$text, sep = ' ', bigrams = TRUE, boundary = FALSE, bigram.binder = ' ')$bigrams
#[1] "César Moreira" "Moreira Nuñez"
Digging in the source code for tidytext
, it looks like the words and ngrams are split using the tokenizer
package. Those functions use different methods: tokenize_words
uses stri_split
, whereas tokenize_ngrams
uses custom C++ code.
I imagine the final step -- switching between R and C++ data types -- garbles the diacritics, although I can't explain why precisely.
We have chatted with several people who have run into issues with encoding before, with Polish and Estonian. It's always a bit tricky because I can never reproduce the problem locally, as I cannot with your problem:
library(tidytext)
library(dplyr)
df <- data_frame(
text = "César Moreira Nuñez"
)
df %>%
unnest_tokens(word, text)
#> # A tibble: 3 x 1
#> word
#> <chr>
#> 1 césar
#> 2 moreira
#> 3 nuñez
df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2 )
#> # A tibble: 2 x 1
#> bigram
#> <chr>
#> 1 césar moreira
#> 2 moreira nuñez
You say that your code works fine on Linux, and this aligns with others' experience as well. This seems to always be a Windows encoding issue. This isn't related to the code in the tidytext package, or even the tokenizers package; from what I've seen, I suspect this is related to the C libraries in stringi and how they act on Windows compared to other platforms. Because of this, you'll likely have the same problems with anything that depends on stringi (which is practically ALL of NLP in R).
I don't know what the problem is, but I was able to reproduce it. I can also confirm that the following works on Windows:
library(corpus)
df %>% term_counts(ngrams = 2)
#> text term count
#> 1 1 césar moreira 1
#> 2 1 moreira nuñez 1
The result here is much like that of unnest_tokens
, but is aggregates by term and does not retain the other variables in df
. To get results like unnest_tokens
gives you, join the result with df
using the text
column, something like:
y <- df %>% term_counts(ngrams = 2)
cbind(df[y$text,], y)
来源:https://stackoverflow.com/questions/47715807/does-tidytextunnest-tokens-works-with-spanish-characters