I need a tool to find duplicates or similar blocks of text in a singular text file or set of text files

后端 未结 6 1326
甜味超标
甜味超标 2021-02-06 13:49

I want to automate moving duplicate or similar C code into functions.

This must work under Linux.

相关标签:
6条回答
  • 2021-02-06 13:53

    See CloneDR, a tool for finding exact copy and near-miss (copy-paste-edit) clones in source code. It uses full language parsers to enable it to find clones according to the language structure, minimizing false positives, and to be completely indendent of how the code is commented or formatted, thereby maximing true detection. The CloneDR will find clones when the cloned block has changed variable, inserted statemens or blocks of code.

    It has language front ends for C, C++, COBOL, C#, Java, PHP and a number of other langauges.

    You can see sample clone detection reports at the website.

    0 讨论(0)
  • 2021-02-06 14:00

    Simian (noted earlier) is a good tool for this. I have been using CloneDetective on my project and it works great. CloneDetective is free, so it can't hurt to give it a try.

    0 讨论(0)
  • 2021-02-06 14:01

    A subset of your problem: Detecting duplicate code:

    Try: PMD

    Duplicate code can be hard to find, especially in a large project. But PMD's Copy/Paste Detector (CPD) can find it for you! CPD has been through three major incarnations:

    • First we wrote it using a variant of Michael Wise's Greedy String Tiling algorithm (our variant is described here)
    • Then it was completely rewritten by Brian Ewins using the Burrows-Wheeler transform
    • Finally, it was rewritten by Steve Hawkins to use the Karp-Rabin string matching algorithm.

    ...

    Note that CPD works with Java, JSP, C, C++, Fortran and PHP code.

    0 讨论(0)
  • 2021-02-06 14:10

    https://github.com/hudayou/fib

    Tool to find identical code blocks in a file or directory.

    0 讨论(0)
  • 2021-02-06 14:15

    Be aware that you can't just compare lines of text. You will have to parse the code, in this manner, you could also detect segments that are semantically correct but may have different named identifiers.

    For example, given two functions that are equivalent but use different identifiers, a text search will not see them as identical, but a parser can.

    Also note that writing a C++ parser is not a trivial task, even when given the grammar. I suggest the advice of others and seek out a tool for this. Also search for refactoring tools.

    0 讨论(0)
  • 2021-02-06 14:16

    You'll want to take a look at Simian. It's free for noncommercial projects. Try something like:

    # Find all C source files and identify similarities/duplicate code.
    simian -includes=**/*.c -excludes=**/*_test.c
    
    0 讨论(0)
提交回复
热议问题