Suffix tree library for c++ with simple examples how to use it

时光毁灭记忆、已成空白 提交于 2019-12-04 11:52:25
Konrad Rudolph

Take a look at the SeqAn library which offers high-performance implementations of various search algorithms and data structures with documentation.

For instance, the suffix array class can be used as a drop-in replacement for suffix trees.

Apart from that, your problem sounds inherently complex, I’m not sure how much you can speed it up. In a general phrasing it’s a multiple alignment problem which is NP hard. You can probably transform this into something more tractable since you’re only interested in exact submatches but it’s still complex.

You might want to have a look at the implementations made for the Pizza&Chili project. They do not have suffix trees, but suffix arrays and various compressed indexes. The plain (non-compressed) suffix array should be ideal for your purposes, even though it is not a suffix tree.

(You will find downloadable code under the "Index Collection" link.)

SDSL is very mature, with implementations of suffix tree, suffix array, wavelet tree, and many other structures in C++.

"The Succinct Data Structure Library (SDSL) is a powerful and flexible C++11 library implementing succinct data structures. In total, the library contains the highlights of 40 research publications. Succinct data structures can represent an object (such as a bitvector or a tree) in space close to the information-theoretic lower bound of the object while supporting operations of the original object efficiently. The theoretical time complexity of an operation performed on the classical data structure and the equivalent succinct data structure are (most of the time) identical."

A list of structures implemented in SDSL can be found here.

An example of average LCP - longest common prefix search using suffix tree (example from SDSL sources, file text-statistics.cpp):

#include <sdsl/suffix_trees.hpp>
#include <iostream>

using namespace std;
using namespace sdsl;

typedef cst_sct3<> cst_t;
typedef cst_t::char_type char_type;

int main(int argc, char* argv[])
{
    if (argc < 2) {
        cout << "Usage: "<< argv[0] << " file" << endl;
        cout << "(1) Generates the CST of file." << endl;
        cout << "(2) Calculates the avg LCP value and the runs in the BWT." << endl;
        return 1;
    }
    cst_t cst;
    construct(cst, argv[1], 1);

    long double runs = 1;
    long double avg_lcp = 0;
    if (cst.csa.size()) {
        char_type prev_bwt = cst.csa.bwt[0];
        for (uint64_t i=1; i<cst.csa.size(); ++i) {
            char_type bwt = cst.csa.bwt[i];
            if (prev_bwt != bwt) {
                runs += 1.0;
            }
            prev_bwt = bwt;
            avg_lcp += cst.lcp[i];
        }
        avg_lcp /= cst.csa.size();
        for (size_t k=0; k<=5; k++) {
            cout << "H_" << k << ": " << Hk(cst,k).first << endl;
        }
        cout << "avg LCP: " << avg_lcp << endl;
        cout << "runs in BWT: " << runs << endl;
    }
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!