Highlight a list of words using a regular expression in c#

前端 未结 5 1221
鱼传尺愫
鱼传尺愫 2021-01-13 15:56

I have some site content that contains abbreviations. I have a list of recognised abbreviations for the site, along with their explanations. I want to create a regular expre

相关标签:
5条回答
  • 2021-01-13 16:32

    I doubt it will perform better than just doing normal string.replace, so if performance is critical measure (refactoring a bit to use a compiled regex). You can do the regex version as:

    var abbrsWithPipes = "(abbr1|abbr2)";
    var regex = new Regex(abbrsWithPipes);
    return regex.Replace(html, m => GetReplaceForAbbr(m.Value));
    

    You need to implement GetReplaceForAbbr, which receives the specific abbr being matched.

    0 讨论(0)
  • 2021-01-13 16:35

    First you would need to Regex.Escape() all the input strings.

    Then you can look for them in the string, and iteratively replace them by the markup you have in mind:

    string abbr      = "memb";
    string word      = "Member";
    string pattern   = String.Format("\b{0}\b", Regex.Escape(abbr));
    string substitue = String.Format("[a title=\"{0}\"]{1}[/a]", word, abbr);
    string output    = Regex.Replace(input, pattern, substitue);
    

    EDIT: I asked if a simple String.Replace() wouldn't be enough - but I can see why regex is desirable: you can use it to enforce "whole word" replacements only by making a pattern that uses word boundary anchors.

    You can go as far as building a single pattern from all your escaped input strings, like this:

    \b(?:{abbr_1}|{abbr_2}|{abbr_3}|{abbr_n})\b
    

    and then using a match evaluator to find the right replacement. This way you can avoid iterating the input string more than once.

    0 讨论(0)
  • 2021-01-13 16:36

    For anyone interested, here is my final solution. It is for a .NET user control. It uses a single pattern with a match evaluator, as suggested by Tomalak, so there is no foreach loop. It's an elegant solution, and it gives me the correct output for the sample input while preserving correct casing for matched strings.

    public partial class Abbreviations : System.Web.UI.UserControl
    {
        private Dictionary<String, String> dictionary = DataHelper.GetAbbreviations();
    
        protected void Page_Load(object sender, EventArgs e)
        {
            string input = "This is just a little test of the memb. And another memb, but not amemba to see if it gets picked up. Deb of course should also be caught here.deb!";
    
            var regex = "\\b(?:" + String.Join("|", dictionary.Keys.ToArray()) + ")\\b";
    
            MatchEvaluator myEvaluator = new MatchEvaluator(GetExplanationMarkup);
    
            input = Regex.Replace(input, regex, myEvaluator, RegexOptions.IgnoreCase);
    
            litContent.Text = input;
        }
    
        private string GetExplanationMarkup(Match m)
        {
            return string.Format("<b title='{0}'>{1}</b>", dictionary[m.Value.ToLower()], m.Value);
        }
    }
    

    The output looks like this (below). Note that it only matches full words, and that the casing is preserved from the original string:

    This is just a little test of the <b title='Member'>memb</b>. And another <b title='Member'>memb</b>, but not amemba to see if it gets picked up. <b title='Debut'>Deb</b> of course should also be caught here.<b title='Debut'>deb</b>!
    
    0 讨论(0)
  • 2021-01-13 16:36

    I'm doing pretty exactly what you're looking for in my application and this works for me: the parameter str is your content:

    public static string GetGlossaryString(string str)
            {
                List<string> glossaryWords = GetGlossaryItems();//this collection would contain your abbreviations; you could just make it a Dictionary so you can have the abbreviation-full term pairs and use them in the loop below 
    
                str = string.Format(" {0} ", str);//quick and dirty way to also search the first and last word in the content.
    
                foreach (string word in glossaryWords)
                    str = Regex.Replace(str, "([\\W])(" + word + ")([\\W])", "$1<span class='glossaryItem'>$2</span>$3", RegexOptions.IgnoreCase);
    
                return str.Trim();
            }
    
    0 讨论(0)
  • 2021-01-13 16:40

    Not sure how well this will scale to a big word list, but I think it should give the output you want (although in your question the 'result' seems identical to 'content')?

    Anyway, let me know if this is what you're after

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text.RegularExpressions;
    
    namespace ConsoleApplication1
    {
        class Program
        {
            static void Main(string[] args)
            {
                var input = @"This is just a little test of the memb to see if it gets picked up. 
    Deb of course should also be caught here.";
                var dictionary = new Dictionary<string,string>
                {
                    {"memb", "Member"}
                    ,{"deb","Debut"}
                };
                var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
                foreach (Match metamatch in Regex.Matches(input
                   , regex  /*@"(memb)|(deb)"*/
                   , RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
                { 
                    input = input.Replace(metamatch.Value, dictionary[metamatch.Value.ToLower()]);
                }
                Console.Write (input);
                Console.ReadLine();
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题