Does tidytext::unnest_tokens works with spanish characters?

I am trying to use unnest_tokens with spanish text. It works fine with unigrams, but breaks the special characters with bigrams.

The code works fine on Linux. I added some info on the locale.

library(tidytext)
library(dplyr)

df <- data_frame(
  text = "César Moreira Nuñez"
)

# works ok:
df %>% 
  unnest_tokens(word, text)


# # A tibble: 3 x 1
# word
# <chr>
# 1 césar
# 2 moreira
# 3 nuñez

# breaks é and ñ
df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2 )

# # A tibble: 2 x 1
# bigram
# <chr>
# 1 cã©sar moreira
# 2 moreira nuã±ez

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

It seems that it happens when you change the token argument to ngrams. I m not sure why it does that, but here is a work around using package qlcMatrix

library(qlcMatrix)

splitStrings(df$text, sep = ' ', bigrams = TRUE, boundary = FALSE, bigram.binder = ' ')$bigrams
#[1] "César Moreira" "Moreira Nuñez"

Digging in the source code for tidytext, it looks like the words and ngrams are split using the tokenizer package. Those functions use different methods: tokenize_words uses stri_split, whereas tokenize_ngrams uses custom C++ code.

I imagine the final step -- switching between R and C++ data types -- garbles the diacritics, although I can't explain why precisely.

We have chatted with several people who have run into issues with encoding before, with Polish and Estonian. It's always a bit tricky because I can never reproduce the problem locally, as I cannot with your problem:

library(tidytext)
library(dplyr)

df <- data_frame(
  text = "César Moreira Nuñez"
)

df %>% 
  unnest_tokens(word, text)
#> # A tibble: 3 x 1
#>   word   
#>   <chr>  
#> 1 césar  
#> 2 moreira
#> 3 nuñez

df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2 )
#> # A tibble: 2 x 1
#>   bigram       
#>   <chr>        
#> 1 césar moreira
#> 2 moreira nuñez

You say that your code works fine on Linux, and this aligns with others' experience as well. This seems to always be a Windows encoding issue. This isn't related to the code in the tidytext package, or even the tokenizers package; from what I've seen, I suspect this is related to the C libraries in stringi and how they act on Windows compared to other platforms. Because of this, you'll likely have the same problems with anything that depends on stringi (which is practically ALL of NLP in R).

I don't know what the problem is, but I was able to reproduce it. I can also confirm that the following works on Windows:

library(corpus)
df %>% term_counts(ngrams = 2)
#>   text term          count
#> 1 1    césar moreira     1
#> 2 1    moreira nuñez     1

The result here is much like that of unnest_tokens, but is aggregates by term and does not retain the other variables in df. To get results like unnest_tokens gives you, join the result with df using the text column, something like:

y <- df %>% term_counts(ngrams = 2)
cbind(df[y$text,], y)

来源：https://stackoverflow.com/questions/47715807/does-tidytextunnest-tokens-works-with-spanish-characters

标签

tidytext