问题
I am interested in finding similar content(text) based on paraphrasing. How do I do this? Are there any specific tools which can do this? In python preferably.
回答1:
I believe the tool you are looking for is Latent Semantic Analysis.
Given that my post is going to be quite lengthy, I'm not going to go into much detail explaining the theory behind it - if you think that it is indeed what you are looking for, the I recommend you look it up. The best place to start would be here:
http://staff.scm.uws.edu.au/~lapark/lt.pdf
In summary, LSA attempts to uncover the underlying / latent meaning of words and phrases based on the assumption that similar words appear in similar documents. I'll be using R
to demonstrate how it works.
I'm going to set up a function that is going to retrieve similar documents based on their latent meaning:
# Setting up all the needed functions:
SemanticLink = function(text,expression,LSAS,n=length(text),Out="Text"){
# Query Vector
LookupPhrase = function(phrase,LSAS){
lsatm = as.textmatrix(LSAS)
QV = function(phrase){
q = query(phrase,rownames(lsatm))
t(q)%*%LSAS$tk%*%diag(LSAS$sk)
}
q = QV(phrase)
qd = 0
for (i in 1:nrow(LSAS$dk)){
qd[i] <- cosine(as.vector(q),as.vector(LSAS$dk[i,]))
}
qd
}
# Handling Synonyms
Syns = function(word){
wl = gsub("(.*[[:space:]].*)","",
gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","",
unlist(strsplit(PlainTextDocument(synonyms(word)),","))))
wl = wl[wl!=""]
return(wl)
}
ex = unlist(strsplit(expression," "))
for(i in seq(ex)){ex = c(ex,Syns(ex[i]))}
ex = unique(wordStem(ex))
cache = LookupPhrase(paste(ex,collapse=" "),LSAS)
if(Out=="Text"){return(text[which(match(cache,sort(cache,decreasing=T)[1:n])!="NA")])}
if(Out=="ValuesSorted"){return(sort(cache,decreasing=T)[1:n]) }
if(Out=="Index"){return(which(match(cache,sort(cache,decreasing=T)[1:n])!="NA"))}
if(Out=="ValuesUnsorted"){return(cache)}
}
Note that that we make use of synonyms here when assembling our query vector. This approach isn't perfect because some of the synonyms in the qdap
library are remote at best... This may interfere with your search query, so to achieve more accurate but less generalizable results, you can simply get rid of the synonyms bit and manually select all relevant terms that make up your query vector.
Let's try it out. I'll also be using the US Congress dataset from the package RTextTools
:
library(tm)
library(RTextTools)
library(lsa)
library(data.table)
library(stringr)
library(qdap)
data(USCongress)
text = as.character(USCongress$text)
corp = Corpus(VectorSource(text))
parameters = list(minDocFreq = 1,
wordLengths = c(2,Inf),
tolower = TRUE,
stripWhitespace = TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE,
stemming = TRUE,
stopwords = TRUE,
tokenize = NULL,
weighting = function(x) weightSMART(x,spec="ltn"))
tdm = TermDocumentMatrix(corp,control=parameters)
tdm.reduced = removeSparseTerms(tdm,0.999)
# setting up LSA space - this may take a little while...
td.mat = as.matrix(tdm.reduced)
td.mat.lsa = lw_bintf(td.mat)*gw_idf(td.mat) # you can experiment with weightings here
lsaSpace = lsa(td.mat.lsa,dims=dimcalc_raw()) # you don't have to keep all dimensions
lsa.tm = as.textmatrix(lsaSpace)
l = 50
exp = "support trade"
SemanticLink(text,exp,n=5,lsaSpace,Out="Text")
[1] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small businesses, and for other purposes."
[2] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel AJ."
[3] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the yacht EXCELLENCE III."
[4] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel M/V Adios."
[5] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small business, and for other purposes."
As you can see, that while "support trade" may not appear as such in the above example, the function has retrieved a set of documents which are relevant to the query. The function is designed to retrieve documents with semantic linkages rather than exact matches.
We can also see how "close" these documents are to the query vector by plotting the cosine distances:
plot(1:l,SemanticLink(text,exp,lsaSpace,n=l,Out="ValuesSorted")
,type="b",pch=16,col="blue",main=paste("Query Vector Proximity",exp,sep=" "),
xlab="observations",ylab="Cosine")
I don't have enough reputation yet to produce the plot though, sorry.
As you would see, the first 2 entries appear to be more associated with the query vector than the rest (there are about 5 that are particularly relevant though), even though reading though them you wouldn't have though so. I would say that this is the effect of using synonyms to build your query vectors. Ignoring that however, the graph allows us how many other documents are remotely similar to the query vector.
EDIT:
Just recently, I've had to solve the problem you are trying to solve, but the above function just wouldn't work well, simply because the data was atrocious - the text was short, there was very little of it and not many topics were explored. So to find relevant entries in such situations, I've developed another function that is purely based on regular expressions.
Here it goes:
HLS.Extract = function(pattern,text=active.text){
require(qdap)
require(tm)
require(RTextTools)
p = unlist(strsplit(pattern," "))
p = unique(wordStem(p))
p = gsub("(.*)i$","\\1y",p)
Syns = function(word){
wl = gsub("(.*[[:space:]].*)","",
gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","",
unlist(strsplit(PlainTextDocument(synonyms(word)),","))))
wl = wl[wl!=""]
return(wl)
}
trim = function(x){
temp_L = nchar(x)
if(temp_L < 5) {N = 0}
if(temp_L > 4 && temp_L < 8) {N = 1}
if(temp_L > 7 && temp_L < 10) {N = 2}
if(temp_L > 9) {N = 3}
x = substr(x,0,nchar(x)-N)
x = gsub("(.*)","\\1\\\\\\w\\*",x)
return(x)
}
# SINGLE WORD SCENARIO
if(length(p)<2){
# EXACT
p = trim(p)
ndx_exact = grep(p,text,ignore.case=T)
text_exact = text[ndx_exact]
# SEMANTIC
p = unlist(strsplit(pattern," "))
express = new.exp = list()
express = c(p,Syns(p))
p = unique(wordStem(express))
temp_exp = unlist(strsplit(express," "))
temp.p = double(length(seq(temp_exp)))
for(j in seq(temp_exp)){
temp_exp[j] = trim(temp_exp[j])
}
rgxp = paste(temp_exp,collapse="|")
ndx_s = grep(paste(temp_exp,collapse="|"),text,ignore.case=T,perl=T)
text_s = as.character(text[ndx_s])
f.object = list("ExactIndex" = ndx_exact,
"SemanticIndex" = ndx_s,
"ExactText" = text_exact,
"SemanticText" = text_s)
}
# MORE THAN 2 WORDS
if(length(p)>1){
require(combinat)
# EXACT
for(j in seq(p)){p[j] = trim(p[j])}
fp = factorial(length(p))
pmns = permn(length(p))
tmat = matrix(0,fp,length(p))
permut = double(fp)
temp = double(length(p))
for(i in 1:fp){
tmat[i,] = pmns[[i]]
}
for(i in 1:fp){
for(j in seq(p)){
temp[j] = paste(p[tmat[i,j]])
}
permut[i] = paste(temp,collapse=" ")
}
permut = gsub("[[:space:]]",
"[[:space:]]+([[:space:]]*\\\\w{,3}[[:space:]]+)*(\\\\w*[[:space:]]+)?([[:space:]]*\\\\w{,3}[[:space:]]+)*",permut)
ndx_exact = grep(paste(permut,collapse="|"),text)
text_exact = as.character(text[ndx_exact])
# SEMANTIC
p = unlist(strsplit(pattern," "))
express = list()
charexp = permut = double(length(p))
for(i in seq(p)){
express[[i]] = c(p[i],Syns(p[i]))
express[[i]] = unique(wordStem(express[[i]]))
express[[i]] = gsub("(.*)i$","\\1y",express[[i]])
for(j in seq(express[[i]])){
express[[i]][j] = trim(express[[i]][j])
}
charexp[i] = paste(express[[i]],collapse="|")
}
charexp = gsub("(.*)","\\(\\1\\)",charexp)
charexpX = double(length(p))
for(i in 1:fp){
for(j in seq(p)){
temp[j] = paste(charexp[tmat[i,j]])
}
permut[i] = paste(temp,collapse=
"[[:space:]]+([[:space:]]*\\w{,3}[[:space:]]+)*(\\w*[[:space:]]+)?([[:space:]]*\\w{,3}[[:space:]]+)*")
}
rgxp = paste(permut,collapse="|")
ndx_s = grep(rgxp,text,ignore.case=T)
text_s = as.character(text[ndx_s])
temp.f = function(x){
if(length(x)==0){x=0}
}
temp.f(ndx_exact); temp.f(ndx_s)
temp.f(text_exact); temp.f(text_s)
f.object = list("ExactIndex" = ndx_exact,
"SemanticIndex" = ndx_s,
"ExactText" = text_exact,
"SemanticText" = text_s,
"Synset" = express)
}
return(f.object)
cat(paste("Exact Matches:",length(ndx_exact),sep=""))
cat(paste("\n"))
cat(paste("Semantic Matches:",length(ndx_s),sep=""))
}
Trying it out:
HLS.Extract("buy house",
c("we bought a new house",
"I'm thinking about buying a new home",
"purchasing a brand new house"))[["SemanticText"]]
$SemanticText
[1] "I'm thinking about buying a new home" "purchasing a brand new house"
As you can see, the function is quite flexible. It would also pick up "home buying". It didn't pick up "we bought a new house" though, because "bought" is an irregular verb - it's the kind of thing that LSA would have picked up.
So you may like to try both and see which one works better. The SemanticLink function also requires a ton of memory, and when you have a particularly large corpus, you won't be able to use it
Cheers
回答2:
I recommend you to read answers to this question, especially first two answers are really good.
I can also recommend Natural language processing toolkit (haven't personally tried)
回答3:
For similarity between news articles, you could extract keywords using part of speech tagging. NLTK provides a good POS tagger. Using nouns and noun phrases as keywords, represent each news article as a keyword vector.
Then use cosine similarity or some such text similarity measure to quantify similarity.
Further enhancements include handling synonyms, word stemming, handling adjectives if required, using TF-IDF as keyword weights in the vector, etc.
来源:https://stackoverflow.com/questions/21206048/find-similar-texts-based-on-paraphrase-detection