问题
I am reading about the (apparently) well known problem of the longest common substring in a series of strings, and have been following these two videos which talk about how to solve the problem using suffix arrays: (note that this question doesn't require you to watch them):
https://youtu.be/Ic80xQFWevc
https://youtu.be/DTLjHSToxmo
The first step is that we start by concatenating all the source strings into one big one, separating each with a 'unique' sentinel, where the ASCII code of each sentinel is less than that of any character that may occur in any string. So we could have the individual strings
abca
bcad
daca
and concatenate them to give
abca#bcad$daca%
Now, there are only a limited number of possible sentinels, which leads to problems if we have a large number of strings. Indeed, someone has pointed this out on the first linked video, the response to which was
Correct, the solution is to map your alphabet to the natural numbers and shift up by the number of sentinels you need. This allows you to always have sentinels between the values say [1,N] and your alphabet above that. This trick makes the suffix array scaleable, but you need to undo the shift the decode the true value stored in the suffix array.
I don't understand what the answer means.
I know I could post my question on the video, but I am not guaranteed a (timely) response and the audience here is far wider, so am asking people here: could someone please explain what this answer means and how to implement it?
回答1:
Not sure how to explain it better/different than in the quoted comment. Maybe an example will help. Note that I am not using the true ASCII codes here as I do not want to show an example with ~100 source strings. So instead, we will just assume A=1, B=2, C=3, etc.
Thus, your source strings abca bcad daca
would translate to [1,2,3,1],[2,3,1,4],[4,1,3,1]
, but in order to fit in the three sentinels, you have to shift all those values up by 3, i.e. 1 to 3 are now sentinels and A=4, B=5, etc.; the joined "string" (actually, it is a list of integers now) is [4,5,6,4, 1, 5,6,4,7, 2, 7,4,6,4, 3]
. You can then translate those back to characters defda...
, do the algorithm, and then translate back, undoing the shift.
However, I would argue that instead of shifting the integers, we could just as well use negative numbers for the sentinels and then work directly on the list of integers instead of converting those back to characters (which is not possible for negative numbers): [1,2,3,1, -1, 2,3,1,4, -2, 4,1,3,1, -3]
(Note: I have not watched the video and do not know how this specific algorithm works; it could be that negative numbers are a problem, e.g. in case this is using some sort of "shortest path" algorithm.)
来源:https://stackoverflow.com/questions/57708774/longest-common-substring-via-suffix-array-uses-of-sentinel