C# Finding relevant document snippets for search result display

后端 未结 8 532
野性不改
野性不改 2021-02-04 12:07

In developing search for a site I am building, I decided to go the cheap and quick way and use Microsoft Sql Server\'s Full Text Search engine instead of something more robust l

8条回答
  •  抹茶落季
    2021-02-04 13:12

    Well, here's the hacked together version I made using the algorithm I described above. I don't think it is all that great. It uses three (count em, three!) loops an array and two lists. But, well, it is better than nothing. I also hardcoded the maximum length instead of turning it into a parameter.

    private static string FindRelevantSnippets(string infoText, string[] searchTerms)
        {
            List termLocations = new List();
            foreach (string term in searchTerms)
            {
                int termStart = infoText.IndexOf(term);
                while (termStart > 0)
                {
                    termLocations.Add(termStart);
                    termStart = infoText.IndexOf(term, termStart + 1);
                }
            }
    
            if (termLocations.Count == 0)
            {
                if (infoText.Length > 250)
                    return infoText.Substring(0, 250);
                else
                    return infoText;
            }
    
            termLocations.Sort();
    
            List termDistances = new List();
            for (int i = 0; i < termLocations.Count; i++)
            {
                if (i == 0)
                {
                    termDistances.Add(0);
                    continue;
                }
                termDistances.Add(termLocations[i] - termLocations[i - 1]);
            }
    
            int smallestSum = int.MaxValue;
            int smallestSumIndex = 0;
            for (int i = 0; i < termDistances.Count; i++)
            {
                int sum = termDistances.Skip(i).Take(5).Sum();
                if (sum < smallestSum)
                {
                    smallestSum = sum;
                    smallestSumIndex = i;
                }
            }
            int start = Math.Max(termLocations[smallestSumIndex] - 128, 0);
            int len = Math.Min(smallestSum, infoText.Length - start);
            len = Math.Min(len, 250);
            return infoText.Substring(start, len);
        }
    

    Some improvements I could think of would be to return multiple "snippets" with a shorter length that add up to the longer length -- this way multiple parts of the document can be sampled.

提交回复
热议问题