问题
I'm searching for suffix tree library (that has linear time construction), and all I found is PATL, but PATL has no documentation and I can't figure out any of the examples. So is there a suffix tree library for c++ that has a decent documentation?
PATL home : http://code.google.com/p/patl/
EDIT:
Motivation: I need to process large amount of strings and find the frequent common substrings, and report if more than n occurrences of any substring occurred within t seconds. I implemented a tree (with counter in the nodes, actually it isn't a counter but an std::vector of visit times since like I said I need time), but it is very slow.
So I thought of bulking up (concatenating with some random stuff between strings so that substrings don't span over more than one string) a certain amount of messages (let's say 30 seconds worth of data) and then build a suffix tree on that string.
回答1:
Take a look at the SeqAn library which offers high-performance implementations of various search algorithms and data structures with documentation.
For instance, the suffix array class can be used as a drop-in replacement for suffix trees.
Apart from that, your problem sounds inherently complex, I’m not sure how much you can speed it up. In a general phrasing it’s a multiple alignment problem which is NP hard. You can probably transform this into something more tractable since you’re only interested in exact submatches but it’s still complex.
回答2:
You might want to have a look at the implementations made for the Pizza&Chili project. They do not have suffix trees, but suffix arrays and various compressed indexes. The plain (non-compressed) suffix array should be ideal for your purposes, even though it is not a suffix tree.
(You will find downloadable code under the "Index Collection" link.)
回答3:
SDSL is very mature, with implementations of suffix tree, suffix array, wavelet tree, and many other structures in C++.
"The Succinct Data Structure Library (SDSL) is a powerful and flexible C++11 library implementing succinct data structures. In total, the library contains the highlights of 40 research publications. Succinct data structures can represent an object (such as a bitvector or a tree) in space close to the information-theoretic lower bound of the object while supporting operations of the original object efficiently. The theoretical time complexity of an operation performed on the classical data structure and the equivalent succinct data structure are (most of the time) identical."
A list of structures implemented in SDSL can be found here.
An example of average LCP - longest common prefix search using suffix tree (example from SDSL sources, file text-statistics.cpp
):
#include <sdsl/suffix_trees.hpp>
#include <iostream>
using namespace std;
using namespace sdsl;
typedef cst_sct3<> cst_t;
typedef cst_t::char_type char_type;
int main(int argc, char* argv[])
{
if (argc < 2) {
cout << "Usage: "<< argv[0] << " file" << endl;
cout << "(1) Generates the CST of file." << endl;
cout << "(2) Calculates the avg LCP value and the runs in the BWT." << endl;
return 1;
}
cst_t cst;
construct(cst, argv[1], 1);
long double runs = 1;
long double avg_lcp = 0;
if (cst.csa.size()) {
char_type prev_bwt = cst.csa.bwt[0];
for (uint64_t i=1; i<cst.csa.size(); ++i) {
char_type bwt = cst.csa.bwt[i];
if (prev_bwt != bwt) {
runs += 1.0;
}
prev_bwt = bwt;
avg_lcp += cst.lcp[i];
}
avg_lcp /= cst.csa.size();
for (size_t k=0; k<=5; k++) {
cout << "H_" << k << ": " << Hk(cst,k).first << endl;
}
cout << "avg LCP: " << avg_lcp << endl;
cout << "runs in BWT: " << runs << endl;
}
}
来源:https://stackoverflow.com/questions/9684034/suffix-tree-library-for-c-with-simple-examples-how-to-use-it