A lemmatizing function using a hash dictionary does not work with tm package in R

孤街醉人 提交于 2019-12-04 20:18:40

I see two problems here. 1) your custom function returns a list, while it should return a vector of strings; and 2) you are passing a wrong lemma_hashmap argument.

A quick workaround to fix the first problem is to use paste() and sapply() before returning the function result.

lemma_tokenizer = function(x, lemma_hashmap, 
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }

  # paste together, return a vector
  sapply(tokens_list, (function(i){paste(i, collapse = " ")}))
}

We can run the same example of your post.

texts = c("Abadanowi abadańczykach OutOfVocabulary", 
          "abadańczyk Abadan OutOfVocabulary")
lemma_tokenizer(texts, lemma_hm)
[1] "Abadan abadańczyk OutOfVocabulary" "abadańczyk Abadan OutOfVocabulary"

Now, we can use tm_map. Just make sure to use lemma_hm (i.e., the variable) and not "lemma_hm" (a string) as argument.

docs <- SimpleCorpus(VectorSource(texts))
out <- tm_map(docs, (function(x) {lemma_tokenizer(x, lemma_hashmap=lemma_hm)}))
out[[1]]$content
[1] "Abadan abadańczyk OutOfVocabulary"

For polish lemmatization please refer to this script https://github.com/MarcinKosinski/trigeR5/blob/master/R/lematyzacja.R that uses this polmorfologik dictionary https://github.com/MarcinKosinski/trigeR5/tree/master/dicts (and also stop words are included there).

Try using quanteda's dictionary() function, after creating a dictionary mapping each variant as a dictionary value, to the lemma as a dictionary key. Below, it looks up your values and then pastes the tokens back into a text. (If you wanted tokens, you would not need the last paste() operation.

txt <-  
    "Abadan  Abadanem
Abadan  Abadanie
Abadan  Abadanowi
Abadan  Abadanu
abadańczyk  abadańczycy
abadańczyk  abadańczykach
abadańczyk  abadańczykami"

list_temp <- strsplit(readLines(textConnection(txt)), "\\s+")
list_temp2 <- lapply(list_temp, "[", 2)
names(list_temp2) <- sapply(list_temp, "[", 1)

library("quanteda")
polish_lemma_dict <- dictionary(list_temp2)
# Dictionary object with 7 key entries.
# - Abadan:
#   - abadanem
# - Abadan:
#   - abadanie
# - Abadan:
#   - abadanowi
# - Abadan:
#   - abadanu
# - abadańczyk: 
#   - abadańczycy
# - abadańczyk:
#   - abadańczykach
# - abadańczyk:
#   - abadańczykami

texts <- c("Abadanowi abadańczykach OutOfVocabulary", 
           "abadańczyk Abadan OutOfVocabulary")

The texts can now be converted into tokens, and use quanteda's tokens_lookup() function to replace the dictionary values (inflected words) with the dictionary keys (lemmas). In the last step, I've pasted the tokens back together, which you can skip if you want tokens and not a full text.

require(magrittr)
texts %>%
    tokens() %>%
    tokens_lookup(dictionary = polish_lemma_dict, exclusive = FALSE, capkeys = FALSE) %>%
    as.character() %>%
    paste(collapse = " ")
# [1] "Abadan abadańczyk OutOfVocabulary abadańczyk Abadan OutOfVocabulary"

Here is the complete imperfect code I used the answer in. Credits to many people, I described all sources on the bottom. It is very rough, I realise, but it catches mise for me, ie. I can use txt lemmes dictionary and my stopwords to classify Polish texts. Thanks to Damiano Fantini, Dmitriy Selivanov and many others.

#----1. Set up. ----
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))


library(readtext)
library(tm)
library(proxy)
library(stringi)
library(stringr)
library(hashmap)
library(data.table)
library(text2vec)

# For reading n-grams
library(RWeka) #(*)
BigramTokenizer <- 
           function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3)) #(*)


#----2. Read data. ----
stopwordsPL <- as.vector(str_split(readLines("polish.stopwords.text",encoding = "UTF-8"), pattern = " ",simplify = T))


docs <- VCorpus(DirSource(pattern="txt"))
titles <- rownames(summary(docs))

docs <- tm_map(docs, removeWords, words=stopwordsPL)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, function(x) stri_trans_general(x, "Latin-ASCII"))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)

# for English texts it would be simpler
# docs <- tm_map(docs, removeWords, stopwords("english")) #can add other words to remove
# docs <- tm_map(docs, stemDocument, "english")

#====3. Lemmatize ====
# # Dictionary from http://www.lexiconista.com/datasets/lemmatization/
# lemmadict_file = "lemmatization-pl.text"
# dt = fread(file= lemmadict_file, header = F, col.names = c("lemma", "word"), data.table=T, encoding="UTF-8")
# # I threw away Polish letters, maybe changing locales may help.
# dt$lemma <- stri_trans_general(dt$lemma, "Latin-ASCII;lower")
# dt$word <- stri_trans_general(dt$word, "Latin-ASCII;lower")
# dt <- unique(dt)
# 
# # Creating hash dictionary
# lemma_hm = hashmap(dt$word, dt$lemma)
# 
# # Test if it works
# lemma_hm[["mnozyl"]]
# # [1] "mnozyc"
# 
# save_hashmap(lemma_hm, file="lemma_hm", overwrite = TRUE, compress = TRUE)

lemma_hm <- load_hashmap(file="lemma_hm")

lemma_tokenizer = function(x, lemma_hashmap, 
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }
  # paste together, return a vector
  sapply(tokens_list, (function(i){paste(i, collapse = " ")}))
}

docs <- tm_map(docs, (function(x) {lemma_tokenizer(x, lemma_hashmap=lemma_hm)}))
docs <- tm_map(docs, PlainTextDocument)

#====4. Create document term matrix====

docsTDM <-
  DocumentTermMatrix(docs, control = list(wordLengths = c(5, 25),tokenize = BigramTokenizer))  #  tokenize=LemmaTokenizer, tokenize = BigramTokenizer (*)


docsTDM$dimnames

#====5. Remove sparse and common words====

docsTDM <- removeSparseTerms(docsTDM, .90)

# https://stackoverflow.com/questions/25905144/removing-overly-common-words-occur-in-more-than-80-of-the-documents-in-r

removeCommonTerms <- function (x, pct) 
{
  stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), 
            is.numeric(pct), pct > 0, pct < 1)
  m <- if (inherits(x, "DocumentTermMatrix")) 
    t(x)
  else x
  t <- table(m$i) < m$ncol * (pct)
  termIndex <- as.numeric(names(t[t]))
  if (inherits(x, "DocumentTermMatrix")) 
    x[, termIndex]
  else x[termIndex, ]
}


docsTDM <-
  removeCommonTerms(docsTDM, .8) #remove terms that are in >=80% of the documents
docsTDM$dimnames


#====6. Cluster data (hclust). ====


docsdissim <- dist(as.matrix(docsTDM), method = "cosine")

docsdissim2 <- as.matrix(docsdissim)
dim(docsdissim2)

rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles

h <- hclust(docsdissim, method = "ward.D2")

plot(h, labels = titles, sub = "")

# Library hclust with p-values (pvclust)

library(pvclust)

h_pv <- pvclust(docsdissim2, method.hclust = "ward.D2", method.dist ="correlation")

plot(h_pv)

data.frame(cutree(tree = h_pv$hclust, k = 4))


# pvclust provides two types of p-values: AU (Approximately Unbiased) p-value and BP (Bootstrap Probability) value. 
# AU p-value, which is computed by multiscale bootstrap resampling, is a better approximation to unbiased p-value 
# than BP value computed by normal bootstrap resampling.
# AU p-value > 0.95 we can assume the clusters exist and may stably be 
# observed if we increase the number of observations. 
# (http://stat.sys.i.kyoto-u.ac.jp/prog/pvclust/)

#==== Literature:====
# Original article:
# http://www.rexamine.com/2014/06/text-mining-in-r-automatic-categorization-of-wikipedia-articles/

# Updates to make it work after some functions became obsolete:
# https://stackoverflow.com/questions/34423823/r-automatic-categorization-of-wikipedia-articles
# https://stackoverflow.com/questions/34372166/error-using-termdocumentmatrix-and-dist-functions-in-r
#
# Based on that:
# http://brazenly.blogspot.co.uk/2015/02/r-categorization-clustering-of.html
#
# Sparse terms:
# https://stackoverflow.com/questions/28763389/how-does-the-removesparseterms-in-r-work

# Lemmatizing function:
# https://stackoverflow.com/questions/46122591/a-lemmatizing-function-using-a-hash-dictionary-does-not-work-with-tm-package-in
# https://stackoverflow.com/questions/45762559/lemmatization-using-txt-file-with-lemmes-in-r/45790325#45790325
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!