How to implement proximity rules in tm dictionary for counting words?

问题

Objective

I would like to count the number of times the word "love" appears in a documents but only if it isn't preceded by the word 'not' e.g. "I love films" would count as one appearance whilst "I do not love films" would not count as an appearance.

Question

How would one proceed using the tm package?

R Code

Below is some self contained code which I would like to modify to do the above.

require(tm)

# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.", 
          "I do not love the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
          "I hate the `Red Hot Chilli Peppers`!")

# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)

# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))

# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)

# construct dictionary
my.dictionary.terms <- tolower(c("love", "Hate"))
my.dictionary <- Dictionary(my.dictionary.terms)

# construct the term document matrix
my.tdm <- TermDocumentMatrix(my.corpus, control = list(dictionary = my.dictionary))
inspect(my.tdm)

# Terms  positiveText neutralText negativeText
# hate            0           1            1
# love            2           1            0

Further information

I am trying to reproduce the dictionary rules functionality from the commercial package WordStat. It is able to make use of dictionary rules i.e.

"hierarchical content analysis dictionaries or taxonomies composed of words, word patterns, phrases as well as proximity rules (such as NEAR, AFTER, BEFORE) for achieving precise measurement of concepts"

Also I noticed this interesting SO question: Open-source rule-based pattern matching / information extraction frameworks?

UPDATE 1: Based on @Ben's comment and post I got this (although slightly different at the end it is strongly inspired by his answer so full credit to him)

require(data.table)
require(RWeka)

# bi-gram tokeniser function
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

# get all 1-gram and 2-gram word counts
tdm <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))

# convert to data.table
dt <- as.data.table(as.data.frame(as.matrix(tdm)), keep.rownames=TRUE)
setkey(dt, rn)

# attempt at extracting but includes overlaps i.e. words counted twice 
dt[like(rn, "love")]
#            rn positiveText neutralText negativeText
# 1:     i love            1           0            0
# 2:       love            2           1            0
# 3: love peopl            1           0            0
# 4:   love the            1           1            0
# 5:  most love            1           0            0
# 6:   not love            0           1            0

Then I guess I would need to do some row sub-setting and row subtraction which would lead to something like

dt1 <- dt["love"]
#     rn positiveText neutralText negativeText
#1: love            2           1            0

dt2 <- dt[like(rn, "love") & like(rn, "not")]
#         rn positiveText neutralText negativeText
#1: not love            0           1            0

# somehow do something like 
# DT = dt1 - dt2 
# but I can't work out how to code that but the require output would be
#     rn positiveText neutralText negativeText
#1: love            2           0            0

I don't know how to get that last line using data.table but this approach would be akin to WordStats 'NOT NEAR' dictionary function e.g. in this case only count the word "love" if it deesn't appear within 1-word either directly before or directly after the word 'not'.

If we were to do an m-gram tokeniser then it would be like saying we only count the word "love" if it doesn't appear within (m-1)-words either side of the word "not".

Other approaches are most welcome!

回答1:

This is an interesting question about collocation extraction, which doesn't seem to be built into any packages (except this one, not on CRAN or github though), despite how popular it is in corpus linguistics. I think this code will answer your question, but there might be a more general solution than this.

Here's your example (thanks for the easy to use example)

##############
require(tm)

# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.", 
             "I do not `love` the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
             "I hate the `Red Hot Chilli Peppers`!")

# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)

# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))

# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
# 'not' is a stopword so let's not remove that
# my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)

# construct dictionary - not used in this case
# my.dictionary.terms <- tolower(c("love", "Hate"))
# my.dictionary <- Dictionary(my.dictionary.terms)

Here's my suggestion, making a document term matrix of bigrams and subsetting them

#Tokenizer for n-grams and passed on to the term-document matrix constructor
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))
inspect(txtTdmBi)

# find bigrams that have 'love' in them
love_bigrams <- txtTdmBi$dimnames$Terms[grep("love", txtTdmBi$dimnames$Terms)]

# keep only bigrams where 'love' is not the first word
# to avoid counting 'love' twice and so we can subset 
# based on the preceeding word
require(Hmisc)
love_bigrams <- love_bigrams[sapply(love_bigrams, function(i) first.word(i)) != 'love']
# exclude the specific bigram 'not love'
love_bigrams <- love_bigrams[!love_bigrams == 'not love']

And here's the result, we get a count of 2 for 'love', which has excluded the 'not love' bigram.

# inspect the results
inspect(txtTdmBi[love_bigrams])

A term-document matrix (2 terms, 3 documents)

Non-/sparse entries: 2/4
Sparsity           : 67%
Maximal term length: 9 
Weighting          : term frequency (tf)

           Docs
Terms       positiveText neutralText negativeText
  i love               1           0            0
  most love            1           0            0

# get counts of 'love' (excluding 'not love')
colSums(as.matrix(txtTdmBi[love_bigrams]))
positiveText  neutralText negativeText 
           2            0            0

回答2:

This sounds to me like polarity. While I'm not going to answer the question you ask I maybe asking your larger question of the polarity of sentences. I have implemented the polarity function in qdap version 1.2.0 that can do this, but saving all the intermediate stuff you're asking for would have slowed the function down too much.

library(qdap)
out <- apply_as_df(my.corpus, polarity, polarity.frame = POLENV)
lview(my.corpus)

df <- sentSplit(matrix2df(my.docs.df), "docs")

pols <- list(positives ="love", negatives="hate")
pols2 <- lapply(pols, function(x) term_match(df$docs, x, FALSE))
POLENV <- polarity_frame(positives =pols2[[1]], negatives=pols2[[2]])


output <- with(df, polarity(docs, var1, polarity.frame = POLENV))
counts(output)[, 1:5]

## > counts(output)[, 1:5]
##           var1 wc   polarity pos.words neg.words
## 1 positiveText  7  0.3779645      love         -
## 2 positiveText  9  0.3333333    lovely         -
## 3  neutralText 16  0.0000000      love      hate
## 4  neutralText  5  0.0000000         -         -
## 5 negativeText  7 -0.3779645         -      hate

data.frame(scores(output))[, 1:4]

##           var1 total.sentences total.words ave.polarity
## 1 negativeText               1           7   -0.3779645
## 2  neutralText               2          21    0.0000000
## 3 positiveText               2          16    0.3556489

来源：https://stackoverflow.com/questions/17979104/how-to-implement-proximity-rules-in-tm-dictionary-for-counting-words

标签

nlp

weka

data.table