I have a list of approx. 500,000 strings, each approx. 100 characters long. Given a search term, I want to identify all strings in the list that contain the search term. At
Have you tried loading your strings into a List<string>
and then using the Linq extensions Contains
method?
var myList = new List<string>();
//Code to load your list goes here...
var searchTerm = "find this";
var match = myList.Contains(searchTerm);
public static bool ContainsFast<T>(this IList<T> list, T item)
{
return list.IndexOf(item) >= 0;
}
Base on tests that I did, this variation of Contains
was about 33% faster on my side.
According to these benchmarks, the fastest way to check if a string occurs in a string is the following:
for (int x = 0; x < ss.Length; x++)
for (int y = 0; y < sf.Length; y++
c[y] += ((ss[x].Length - ss[x].Replace(sf[y], String.Empty).Length) / sf[y].Length > 0 ? 1 : 0);
Thus, you could:
Obviously you'd have to adapt them to your List[string] (or whatever data structure you're using).
You should try to use Dictionary class. It's much faster than List because it's an indexed search.
Dictionary<String, String> ldapDocument = new Dictionary<String, String>();
//load your list here
//Sample -> ldapDocument.Add("014548787","014548787");
var match = ldapDocument.ContainsKey(stringToMatch);
I've heard good things about Lucene.NET when it comes to performing quick full-text searches. They've done the work to figure out the fastest data structures and such to use. I'd suggest giving that a shot.
Otherwise, you might just try something like this:
var matches = list.AsParallel().Where(s => s.Contains(searchTerm)).ToList();
But it probably won't get you down to 100ms.
Have you tried the following?
list.FindAll(x => x.Contains("YourTerm")).ToList();
For some reason the List.AsParallel().Where(...) is slower than list.FindAll(...) on my PC.
list.AsParallel().Where(x => x.Contains("YourTerm")).ToList();
Hope this will help you.