Find keyword in text when keyword match certain conditions - C#

后端 未结 1 1064
孤城傲影
孤城傲影 2020-12-20 05:52

I\'m looking for a nice way to do the following:

I have an article which has HTML tags in it like anchors and paragraphs and so on.
I also have keyword which i

相关标签:
1条回答
  • 2020-12-20 06:40

    I have managed to get it done!

    Very much thanks to this post which helped me a lot with the xpath expression: http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/beae72d6-844f-4a9b-ad56-82869d685037/

    My task was to add X keywords to the article using table of keywords and urls on my database.
    Once keyword was matched - it won't search for it again, but will try to find the next keyword in the text.
    The 'keyword' could have been made of more than one word. That's why i added the Replace(" ", "\s+").
    Also, i had to give precedence to the longest keywords first. That is if i had:
    "good day" and "good" as two different keywords - "good day" always wins.

    This is my solution:

    static public string AddLinksToArticle(string article, int linksToAdd)
        {
            try
            {
                //load keywords and urls
                var dt = new DAL().GetArticleLinks();
    
                //sort the it
                IEnumerable<ArticlesRow> sortedArticles = dt.OrderBy(row => row.keyword, new StringLengthComparer());
    
                // iterate the dictionary to get keyword to replace with anchor
                foreach (var item in sortedArticles)
                {
                    article = FindAndReplaceKeywordWithAnchor(article, item.keyword, item.url, ref linksToAdd);
                    if (linksToAdd == 0)
                    {
                        break;
                    }
                }
    
                return article;
            }
            catch (Exception ex)
            {
                Utils.LogErrorAdmin(ex);
                return null;
            }
        }
    
        private static string FindAndReplaceKeywordWithAnchor(string article, string keyword, string url, ref int linksToAdd)
        {
            //convert text to html
            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(article);
    
            // \w* - means it can start with any alphanumeric charactar
            // \s+ - was placed to replace all white spaces (when there is more than one word).
            // \b - set bounderies for the keyword
            string pattern = @"\b" + keyword.Trim().Insert(0, "\\w*").Replace(" ", "\\s+") + @"\b";
    
            //get all elements text propery except for anchor element 
            var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]") ?? new HtmlAgilityPack.HtmlNodeCollection(null);
            foreach (var node in nodes)
            {
                if (node.InnerHtml.Contains(keyword))
                {
                    Regex regex = new Regex(pattern);
                    node.InnerHtml = regex.Replace(node.InnerHtml, "<a href=\"" + url + "\">" + keyword + "</a>", 1);//match only first occurrence
                    linksToAdd--;
                    break;
                }
            }
    
            return doc.DocumentNode.OuterHtml;
        }
    }
    
    public class StringLengthComparer : IComparer<string>
    {
        public int Compare(string x, string y)
        {
            return y.Length.CompareTo(x.Length);
        }
    }
    

    Hope it will help someone in the future.

    0 讨论(0)
提交回复
热议问题