In developing search for a site I am building, I decided to go the cheap and quick way and use Microsoft Sql Server\'s Full Text Search engine instead of something more robust l
Well, here's the hacked together version I made using the algorithm I described above. I don't think it is all that great. It uses three (count em, three!) loops an array and two lists. But, well, it is better than nothing. I also hardcoded the maximum length instead of turning it into a parameter.
private static string FindRelevantSnippets(string infoText, string[] searchTerms)
List termLocations = new List();
foreach (string term in searchTerms)
int termStart = infoText.IndexOf(term);
while (termStart > 0)
termStart = infoText.IndexOf(term, termStart + 1);
if (termLocations.Count == 0)
if (infoText.Length > 250)
return infoText.Substring(0, 250);
return infoText;
List termDistances = new List();
for (int i = 0; i < termLocations.Count; i++)
if (i == 0)
termDistances.Add(termLocations[i] - termLocations[i - 1]);
int smallestSum = int.MaxValue;
int smallestSumIndex = 0;
for (int i = 0; i < termDistances.Count; i++)
int sum = termDistances.Skip(i).Take(5).Sum();
if (sum < smallestSum)
smallestSum = sum;
smallestSumIndex = i;
int start = Math.Max(termLocations[smallestSumIndex] - 128, 0);
int len = Math.Min(smallestSum, infoText.Length - start);
len = Math.Min(len, 250);
return infoText.Substring(start, len);
Some improvements I could think of would be to return multiple "snippets" with a shorter length that add up to the longer length -- this way multiple parts of the document can be sampled.