Longest common substring via suffix array: uses of sentinel

我只是一个虾纸丫 提交于 2019-12-13 20:24:51

问题


I am reading about the (apparently) well known problem of the longest common substring in a series of strings, and have been following these two videos which talk about how to solve the problem using suffix arrays: (note that this question doesn't require you to watch them):

https://youtu.be/Ic80xQFWevc

https://youtu.be/DTLjHSToxmo

The first step is that we start by concatenating all the source strings into one big one, separating each with a 'unique' sentinel, where the ASCII code of each sentinel is less than that of any character that may occur in any string. So we could have the individual strings

abca
bcad
daca

and concatenate them to give

abca#bcad$daca%

Now, there are only a limited number of possible sentinels, which leads to problems if we have a large number of strings. Indeed, someone has pointed this out on the first linked video, the response to which was

Correct, the solution is to map your alphabet to the natural numbers and shift up by the number of sentinels you need. This allows you to always have sentinels between the values say [1,N] and your alphabet above that. This trick makes the suffix array scaleable, but you need to undo the shift the decode the true value stored in the suffix array.

I don't understand what the answer means.

I know I could post my question on the video, but I am not guaranteed a (timely) response and the audience here is far wider, so am asking people here: could someone please explain what this answer means and how to implement it?


回答1:


Not sure how to explain it better/different than in the quoted comment. Maybe an example will help. Note that I am not using the true ASCII codes here as I do not want to show an example with ~100 source strings. So instead, we will just assume A=1, B=2, C=3, etc.

Thus, your source strings abca bcad daca would translate to [1,2,3,1],[2,3,1,4],[4,1,3,1], but in order to fit in the three sentinels, you have to shift all those values up by 3, i.e. 1 to 3 are now sentinels and A=4, B=5, etc.; the joined "string" (actually, it is a list of integers now) is [4,5,6,4, 1, 5,6,4,7, 2, 7,4,6,4, 3]. You can then translate those back to characters defda..., do the algorithm, and then translate back, undoing the shift.

However, I would argue that instead of shifting the integers, we could just as well use negative numbers for the sentinels and then work directly on the list of integers instead of converting those back to characters (which is not possible for negative numbers): [1,2,3,1, -1, 2,3,1,4, -2, 4,1,3,1, -3] (Note: I have not watched the video and do not know how this specific algorithm works; it could be that negative numbers are a problem, e.g. in case this is using some sort of "shortest path" algorithm.)



来源:https://stackoverflow.com/questions/57708774/longest-common-substring-via-suffix-array-uses-of-sentinel

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!