C# Finding relevant document snippets for search result display

后端 未结 8 561
野性不改
野性不改 2021-02-04 12:07

In developing search for a site I am building, I decided to go the cheap and quick way and use Microsoft Sql Server\'s Full Text Search engine instead of something more robust l

8条回答
  •  庸人自扰
    2021-02-04 12:52

    Wrote a function to do this just now. You want to pass in:

    Inputs:

    Document text
    This is the full text of the document you're taking a snippet from. Most likely you will want to strip out any BBCode/HTML from this document.

    Original query
    The string the user entered as their search

    Snippet length
    Length of the snippet you wish to display.

    Return Value:

    Start index of the document text to take the snippet from. To get the snippet simply do documentText.Substring(returnValue, snippetLength). This has the advantage that you know if the snippet is take from the start/end/middle so you can add some decoration like ... if you wish at the snippet start/end.

    Performance

    A resolution set to 1 will find the best snippet but moves the window along 1 char at a time. Set this value higher to speed up execution.

    Tweaks

    You can work out score however you want. In this example I've done Math.pow(wordLength, 2) to favour longer words.

    private static int GetSnippetStartPoint(string documentText, string originalQuery, int snippetLength)
    {
        // Normalise document text
        documentText = documentText.Trim();
        if (string.IsNullOrWhiteSpace(documentText)) return 0;
    
        // Return 0 if entire doc fits in snippet
        if (documentText.Length <= snippetLength) return 0;
    
        // Break query down into words
        var wordsInQuery = new HashSet();
        {
            var queryWords = originalQuery.Split(' ');
            foreach (var word in queryWords)
            {
                var normalisedWord = word.Trim().ToLower();
                if (string.IsNullOrWhiteSpace(normalisedWord)) continue;
                if (wordsInQuery.Contains(normalisedWord)) continue;
                wordsInQuery.Add(normalisedWord);
            }
        }
    
        // Create moving window to get maximum trues
        var windowStart = 0;
        double maxScore = 0;
        var maxWindowStart = 0;
    
        // Higher number less accurate but faster
        const int resolution = 5;
    
        while (true)
        {
            var text = documentText.Substring(windowStart, snippetLength);
    
            // Get score of this chunk
            // This isn't perfect, as window moves in steps of resolution first and last words will be partial.
            // Could probably be improved to iterate words and not characters.
            var words = text.Split(' ').Select(c => c.Trim().ToLower());
            double score = 0;
            foreach (var word in words)
            {
                if (wordsInQuery.Contains(word))
                {
                    // The longer the word, the more important.
                    // Can simply replace with score += 1 for simpler model.
                    score += Math.Pow(word.Length, 2);
                }                   
            }
            if (score > maxScore)
            {
                maxScore = score;
                maxWindowStart = windowStart;
            }
    
            // Setup next iteration
            windowStart += resolution;
    
            // Window end passed document end
            if (windowStart + snippetLength >= documentText.Length)
            {
                break;
            }
        }
    
        return maxWindowStart;
    }
    

    Lots more you can add to this, for example instead of comparing exact words perhaps you might want to try comparing the SOUNDEX where you weight soundex matches less than exact matches.

提交回复
热议问题