问题
I have a bunch of texts about programming in Markdown format. There is a build process that is capable of converting those texts into Word/HTML and also perform simple validation rules like spell checking or checking if document has required header structure. I would like to extend that build code to also check for copy-pasted or similar chunks within all texts.
Is there any existing Java/Groovy library that can help me with that analysis?
My first idea was to use PMD's CopyPasteDetector, but it is too much oriented to analyse real code. I don't see how I can use it to analyse normal text.
回答1:
You might want to try Dude, my own quick and dirty duplication detector for text files. Besides providing you a quick estimate of how much is shared between two text files, it can also determine copying between a set of files, drawing a nice graph of sharing relations.
回答2:
I ended up using CPD and Groovy after all. Here is the code if some one is interested:
import net.sourceforge.pmd.cpd.Tokens
import net.sourceforge.pmd.cpd.TokenEntry
import net.sourceforge.pmd.cpd.Tokenizer
import net.sourceforge.pmd.cpd.CPDNullListener
import net.sourceforge.pmd.cpd.MatchAlgorithm
import net.sourceforge.pmd.cpd.SourceCode
import net.sourceforge.pmd.cpd.SourceCode.StringCodeLoader
import net.sourceforge.pmd.cpd.SimpleRenderer
// Prepare empty token data.
TokenEntry.clearImages()
def tokens = new Tokens()
// List all source files with text.
def source = new TreeMap<String, SourceCode>()
new File('.').eachFile { file ->
if (file.isFile() && file.name.endsWith('.txt')) {
def analyzedText = file.text
def sourceCode = new SourceCode(new StringCodeLoader(analyzedText, file.name))
source.put(sourceCode.fileName, sourceCode)
analyzedText.eachLine { line, lineNumber ->
line.split('[\\W\\s\\t\\f]+').each { token ->
token = token.trim()
if (token) {
tokens.add(new TokenEntry(token, sourceCode.fileName, lineNumber + 1))
}
}
}
tokens.add(TokenEntry.getEOF())
}
}
// Run matching algorithm.
def maxTokenChain = 15
def matchAlgorithm = new MatchAlgorithm(source, tokens, maxTokenChain, new CPDNullListener())
matchAlgorithm.findMatches()
// Produce report.
matchAlgorithm.matches().each { match ->
println " ========================================"
match.iterator().each { mark ->
println " DUPLICATION ERROR: <${mark.tokenSrcID}:${mark.beginLine}> [DUPLICATION] Found a ${match.lineCount} line (${match.tokenCount} tokens) duplication!"
}
def indentedTextSlice = ""
match.sourceCodeSlice.eachLine { line ->
indentedTextSlice += " $line\n"
}
println " ----------------------------------------"
println indentedTextSlice
println " ========================================"
}
回答3:
You can start with a simple implementation Longest Common Substring (LCS) algorithm for two strings. See one Java implementation.
Next, you can see the Suffix Arrays and the Genetics and string algorithms.
See also Longest Common Substring in a big text.
来源:https://stackoverflow.com/questions/17504560/detecting-copied-or-similar-text-blocks