longest-substring

Optimisation ideas - Longest common substring

蓝咒 提交于 2019-12-11 19:24:56
问题 I have this program which is supposed to find the Longest Common Substring of a number of strings. Which it does, but if the strings are very long (i.e. >8000 characters long), it works slowly (1.5 seconds). Is there any way to optimise that? The program is this: //#include "stdafx.h" #include <iostream> #include <string> #include <vector> #include <cassert> using namespace std; const unsigned short MAX_STRINGS = 10; const unsigned int MAX_SIZE=10000; vector<string> strings; unsigned int len;

longest common subsequence function does not work for all examples

橙三吉。 提交于 2019-12-11 18:59:59
问题 EDIT : UP The code does not work properly with the strings below. "1 11 23 1 18 9 15 23 5" "11 1 18 1 20 5 11 1" EDIT : I noticed, that if I change 20 to 40 in second string, the function works properly... For strings: "12 4 55 11 8 43 22 90 5 88 15" "15 66 4 36 43 22 78 88 32" it works properly. Where is the problem? Here is my code: int[][] tabelka = new int[linia1.length()+1][linia2.length()+1]; for (int i = 0; i<linia1.length(); i++) { for (j = 0; j<linia2.length(); j++) { if ( linia1

longest common substring between 2 HUGE files - out of memory: java heap space

…衆ロ難τιáo~ 提交于 2019-12-08 12:21:11
问题 I'm completely brain fried after this, I need to find the longest common substring between 2 files, a small one and a HUGE one. I don't even know where to start to begin the search, heres what I have so far import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; public class MyString { public static void main (String[] args) throws IOException { BufferedReader br = new BufferedReader(new FileReader("MobyDick.txt")); BufferedReader br2 = new BufferedReader(new

Longest common substring via suffix array: do we really need unique sentinels?

烂漫一生 提交于 2019-12-08 04:40:59
问题 I am reading about LCP arrays and their use, in conjunction with suffix arrays, in solving the "Longest common substring" problem. This video states that the sentinels used to separate individual strings must be unique, and not be contained in any of the strings themselves. Unless I am mistaken, the reason for this is so when we construct the LCP array (by comparing how many characters adjacent suffixes have in common) we don't count the sentinel value in the case where two sentinels happen

Longest Common Substring with wrong character tolerance

纵然是瞬间 提交于 2019-12-07 11:38:55
问题 I have a script I found on here that works well when looking for the Lowest Common Substring. However, I need it to tolerate some incorrect/missing characters. I would like be able to either input a percentage of similarity required, or perhaps specify the number of missing/wrong characters allowable. For example, I want to find this string: big yellow school bus inside of this string: they rode the bigyellow schook bus that afternoon This is the code i'm currently using: function longest

r which rows have longest partial string match between two vectors

纵然是瞬间 提交于 2019-12-06 17:40:42
I have two vectors that contain the names of towns, both of which are in different formats, and I need to match the names of water districts (water) to their respective census data (towns). Essentially for each row in water, I need to know the best match in towns, since most of them contain similar words such as city. One other problem I see is that words are capitalized in one data set and are not capitalized in another. Here is my example data: towns= c("Acalanes Ridge CDP, Contra Costa County", "Bellflower city, Los Angeles County", "Arvin city, Kern County", "Alturas city, Modoc County")

Java implementation for longest common substring of n strings

半腔热情 提交于 2019-12-05 15:04:58
I need to find the longest common substring of n strings and use the result in my project. Is there any existing implementation/library in java which already does this? Thanks for your replies in advance. What about concurrent-trees ? It is a small (~100 KB) library available in Maven Central . The algorithm uses combination of Radix and Suffix Trees . Which is known to have a linear time complexity ( wikipedia ). public static String getLongestCommonSubstring(Collection<String> strings) { LCSubstringSolver solver = new LCSubstringSolver(new DefaultCharSequenceNodeFactory()); for (String s:

Longest repeated (k times) substring

白昼怎懂夜的黑 提交于 2019-12-05 14:27:55
问题 I know this is a somewhat beaten topic, but I have reached the limit of help I can get from what's already been answered. This is for the Rosalind project problem LREP. I'm trying to find the longest k-peated substring in a string and I've been provided the suffix tree, which is nice. I know that I need to annotate the suffix table with the number of descendant leaves from each node, then find nodes with >=k descendants, and finally find the deepest of those nodes. Theory-wise I'm set. I've

Longest Common Substring with wrong character tolerance

折月煮酒 提交于 2019-12-05 12:01:08
I have a script I found on here that works well when looking for the Lowest Common Substring. However, I need it to tolerate some incorrect/missing characters. I would like be able to either input a percentage of similarity required, or perhaps specify the number of missing/wrong characters allowable. For example, I want to find this string: big yellow school bus inside of this string: they rode the bigyellow schook bus that afternoon This is the code i'm currently using: function longest_common_substring($words) { $words = array_map('strtolower', array_map('trim', $words)); $sort_by_strlen =

R - Longest common substring

眉间皱痕 提交于 2019-11-28 21:53:39
Does anyone know of an R package that solves the longest common substring problem ? I am looking for something fast that could work on vectors. Check out the "Rlibstree" package on omegahat: http://www.omegahat.org/Rlibstree/ . This uses http://www.icir.org/christian/libstree/ . You should look at the LCS function of qualV package. It is C-implemented, therefore quite efficient. The question here is not totally clear on the intended application of the solution to the longest common substring problem. A common application that I encounter is matching between names in different datasets. The