how to differentiate two very long strings in c++?

问题

I would like to solve Levenshtein_distance this problem where length of string is too huge .

Edit2 :
As Bobah said that title is miss leading , so i had updated the title of questoin .
Initial title was how to declare 100000x100000 2-d integer in c++ ?
Content was
There is any way to declare int x[100000][100000] in c++.
When i declare it globally then compiler produces error: size of array ‘x’ is too large .
One method could be using map< pair< int , int > , int > mymap .
But allocating and deallocating takes more time .
There is any other way like uisng vector<int> myvec ;

回答1:

For memory blocks that large, the best approach is dynamic allocation using the operating system's facilities for adding virtual memory to the process.

However, look how large a block you are trying to allocate:

 40 000 000 000 bytes

I take my previous advice back. For a block that large, the best approach is to analyze the problem and figure out a way to use less memory.

回答2:

Filling the edit distance matrix can be done each row at a time. Remembering the previous row is enough to compute the current row. This observation reduces space usage from quadratic to linear. Makes sense?

回答3:

Your question is very interesting, but the title is misleading.

This is what you need in terms of data model (x - first string, y - second string, * - distance matrix).

      y <-- first string (scrolls from top down)

      y
  x  x  x  x  x  x  x  x  <- second string (scrolls from left to right)
      y *  *  *

      y *  *  *

      y *  *  * <-- distance matrix (a donut) scrolls together with strings
                    and grows/shrinks when needed, as explained below
      y

Have two relatively long (but still << N) character buffers and relatively small ( << buffers size) rectangular (start from square) distance matrix.

Make the matrix a donut - bi-dimentional ring buffer (can use the one from boost, or just std::deque).

When string fragments currently covered by the matrix are 100% match shift both buffers by one, rotate the donut around both axes, recalculating one new row/column in the distance matrix.

When match is <100% and is less than configured threshold then grow the size of the both dimensions of the donut without dropping any rows/columns and do it until either match gets above the threshold or you reach the maximum donut size. When match ratio hits the threshold from the below you need to scroll donut discarding head of x and y buffers and at the same time aligning them (only X needs moving by 1 when the distance matrix tells that X[i] does not exist in Y, but X[i+1,i+m] matches Y[j, j+m-1]).

As a result you will have a simple yet very efficient heuristic diff engine with deterministic limited memory footprint and all memory can be pre-allocated at startup so no dynamic allocation will slow it down at runtime.

Apache v2 license, in case you decide to go for it.

来源：https://stackoverflow.com/questions/26202686/how-to-differentiate-two-very-long-strings-in-c

标签

string

algorithm

memory-management

string-matching

information-theory