Similar code detector

二次信任 提交于 2019-12-31 10:36:37

问题


I'm search for a tool that could compare source codes for similarity.

We have a very trivial system right now that has huge amount of false positives and the real positives can easily get buried in them.

My requirements are:

  • reasonably small amount of false positives
  • good detection rate (yeah these are going against each other)
  • ideally with a more complex output than just a single value
  • usable for C (C99) and C++ (C++03 and optimally C++11)
  • still maintained
  • usable for comparing two source files against each other
  • usable in non-interactive mode

EDIT:

To avoid confusion, the following two code snippets are identical and should be detected as such:

for (int i = 0; i < 10; i++) { bla; }

int i; while (i < 10) { bla; i++; }

The same here:

int x = 10; y = x + 5;

int a = 10; y = a + 5;


回答1:


I've used MOSS in the past: http://theory.stanford.edu/~aiken/moss/ to detect plagiarized code. Since it works on a semantic level, it will detect the situations you presented above. The tool is language-aware, so comments are not considered in the analysis, and it goes a long way in detecting code that has been modified through simple search-and-replace of variable and/or function names.

Note: I used the tool a few years ago when I taught computer science in grad school, and it worked wonderfully in detecting code that had been yanked from the internet. Here is a well-documented account of similar application: http://fie2012.org/sites/fie2012.org/history/fie99/papers/1110.pdf

If you google "measure software similarity", you should find a few more useful hits: http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/detectiontools_sourcecode.html




回答2:


Your problem in Computer Science Terminology maybe stated as Source Code Plagiarism Detection. A good start would be to read this article on Dr Dobbs: Detecting Source-Code Plagiarism. It lists the Algorithms for detecting Plagiarism in the source code.

Note: What you have asked for is indeed a tough computing problem :)




回答3:


May be Copy-paste-detector from PMD?




回答4:


You could try duplo. It will find common lines. It has some ability to ignore whitespace changes, but doesn't detect code with renamed variables, so it is more a cleanup-aid than a help when detecting plagiarism.




回答5:


I start to use JPLAG (https://github.com/jplag/jplag) to check code similarity and compare students works in Java and text files. It works well to check same code structure and variable Substitution.




回答6:


(response is late, but the question's relevance never goes away)

I was faced a similar problem and wrote a web based application.

https://jefferey-cave.gitlab.io/miss/

I was teaching in javascript and python, so those are the languages it handles. It does not handle C/C++ (currently). I'd be curious to see how the Javascript interpreter handles C.

available on gitlab


The problem I was faced with was it being illegal to submit student code across international boundaries (MOSS was forbidden) so needed something that would run locally. The implementation is pure client-side browser.

I found it more useful in determining group dynamics in the classroom (who is working/studying with whom).

It has some fun live graphics, so it was useful to show to an Undergrad class after they submitted their first assignment. There was always a high degree of similarity in the first assignment, so no harm in demonstrating it live (with the submission names anonymized).

I always tell the story of the student I thought was (grossly and blatantly) cheating. Their work showed remarkable similarity to another student's very unique answer. Comparing the student's work to the rest of the class showed no significant similarity relative to the rest of the class. This led to a deeper investigation of the submission ... turns out there had been an tutorial, and the style showed through, but the work was unique.

Nothing happened, and those students never how close they came.



来源:https://stackoverflow.com/questions/10912349/similar-code-detector

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!