Forum tags. What is the best way to implement them?

蓝咒 提交于 2019-12-01 12:39:29

While using "only" regex is probably possible using balancing groups, it's pretty heavy voodoo, and it's intrinsecally "fragile". What I propose is using regexes to find open/close tags (without trying to associate the close with the open), mark and collect them in a collection (a stack probably) and then parse them "by hand" (with a foreach). In this way you have the best of both world: the searching of tags by regex and the handling of them (and of wrongly written ones) by hand.

class TagMatch
{
    public string Tag { get; set; }
    public Capture Capture { get; set; }
    public readonly List<string> Substrings = new List<string>();
}

static void Main(string[] args)
{
    var rx = new Regex(@"(?<OPEN>\[[A-Za-z]+?\])|(?<CLOSE>\[/[A-Za-z]+?\])|(?<TEXT>[^\[]+|\[)");
    var str = "Lorem [AA]ipsum [BB]dolor sit [/BB]amet, [ consectetur ][/AA]adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";
    var matches = rx.Matches(str);

    var recurse = new Stack<TagMatch>();
    recurse.Push(new TagMatch { Tag = String.Empty });

    foreach (Match match in matches)
    {
        var text = match.Groups["TEXT"];

        TagMatch last;

        if (text.Success)
        {
            last = recurse.Peek();
            last.Substrings.Add(text.Value);
            continue;
        }

        var open = match.Groups["OPEN"];

        string tag;

        if (open.Success)
        {
            tag = open.Value.Substring(1, open.Value.Length - 2);
            recurse.Push(new TagMatch { Tag = tag, Capture = open.Captures[0] });
            continue;
        }

        var close = match.Groups["CLOSE"];

        tag = close.Value.Substring(2, close.Value.Length - 3);

        last = recurse.Peek();

        if (last.Tag == tag)
        {
            recurse.Pop();

            var lastLast = recurse.Peek();
            lastLast.Substrings.Add("**" + last.Tag + "**");
            lastLast.Substrings.AddRange(last.Substrings);
            lastLast.Substrings.Add("**/" + last.Tag + "**");
        }
        else
        {
            throw new Exception();
        }
    }

    if (recurse.Count != 1)
    {
        throw new Exception();
    }

    var sb = new StringBuilder();
    foreach (var str2 in recurse.Pop().Substrings)
    {
        sb.Append(str2);
    }

    var str3 = sb.ToString();
}

This is an example. It's case sensitive (but it's easy to solve this problem). It doesn't handle "unpaired" tags, because there are various ways to handle them. Where you find a "throw new Exception" you'll have to add your handling. Clearly this isn't a "drop in" solution. It's only an example. By this logic, I won't respond to questions like "the compiler tells me I need a namespace" or "the compiler can't find Regex". BUT I will be more-than-happy to respond to "advanced" questions, like how could unpaired tags be matched, or how could you add support for [AAA=bbb] tags

(2nd BIG EDIT)

Bwahahahah! I DID know groupings were the way to do it!

// Some classes

class BaseTagMatch {
    public Capture Capture;

    public override string ToString()
    {
        return String.Format("{1}: {2} [{0}]", GetType(), Capture.Index, Capture.Value.ToString());
    }
}

class BeginTag : BaseTagMatch
{
    public int Index;
    public Capture Options;
    public EndTag EndTag;
}

class EndTag : BaseTagMatch {
    public int Index;
    public BeginTag BeginTag;
}

class Text : BaseTagMatch
{
}

class UnmatchedTag : BaseTagMatch
{
}

// The code

var pattern =
    @"(?# line 01) ^" +
    @"(?# line 02) (" +
    // Non [ Text
    @"(?# line 03)   (?>(?<TEXT>[^\[]+))" +
    @"(?# line 04)   |" +
    // Immediately closed tag [a/]
    @"(?# line 05)   (?>\[  (?<TAG>  [A-Z]+  )  \x20*  =?  \x20*  (?<TAG_OPTION>(  (?<=  =  \x20*)  (  (?!  \x20*  /\])  [^\[\]\r\n]  )*  )?  )  (?<BEGIN_INNER_TEXT>)(?<END_INNER_TEXT>)  \x20*  /\]  )" +
    @"(?# line 06)   |" +
    // Matched open tag [a]
    @"(?# line 07)   \[  (?<TAG>  (?<OPEN>  [A-Z]+  )  )  \x20* =?  \x20* (?<TAG_OPTION>(  (?<=  =  \x20*)  (  (?!  \x20*  \])  [^\[\]\r\n]  )*  )?  )  \x20*  \]  (?<BEGIN_INNER_TEXT>)" +
    @"(?# line 08)   |" +
    // Matched close tag [/a]
    @"(?# line 09)   (?>(?<END_INNER_TEXT>)  \[/  \k<OPEN>  \x20*  \]  (?<-OPEN>))" +
    @"(?# line 10)   |" +
    // Unmatched open tag [a]
    @"(?# line 11)   (?>(?<UNMATCHED_TAG>  \[  [A-Z]+  \x20* =?  \x20* (  (?<=  =  \x20*)  (  (?!  \x20*  \])  [^\[\]\r\n]  )*  )?  \x20*  \]  )  )" +
    @"(?# line 12)   |" +
    // Unmatched close tag [/a]
    @"(?# line 13)   (?>(?<UNMATCHED_TAG>  \[/  [A-Z]+  \x20*  \]  )  )" +
    @"(?# line 14)   |" +
    // Single [ of Text (unmatched by other patterns)
    @"(?# line 15)   (?>(?<TEXT>\[))" +
    @"(?# line 16) )*" +
    @"(?# line 17) (?(OPEN)(?!))" +
    @"(?# line 18) $";

var rx = new Regex(pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);

var match = rx.Match("[div=c:max max]asdf[p = 1   ] a [p=2] [b  =  p/pp   /] [q/] \n[a]sd [/z]  [ [/p]f[/p]asdffds[/DIV] [p][/p]");

////var tags = match.Groups["TAG"].Captures.OfType<Capture>().ToArray();
////var tagoptions = match.Groups["TAG_OPTION"].Captures.OfType<Capture>().ToArray();
////var begininnertext = match.Groups["BEGIN_INNER_TEXT"].Captures.OfType<Capture>().ToArray();
////var endinnertext = match.Groups["END_INNER_TEXT"].Captures.OfType<Capture>().ToArray();
////var text = match.Groups["TEXT"].Captures.OfType<Capture>().ToArray();
////var unmatchedtag = match.Groups["UNMATCHED_TAG"].Captures.OfType<Capture>().ToArray();

var tags = match.Groups["TAG"].Captures.OfType<Capture>().Select((p, ix) => new BeginTag { Capture = p, Index = ix, Options = match.Groups["TAG_OPTION"].Captures[ix] }).ToList();

Func<Capture, int, EndTag> func = (p, ix) =>
{
    var temp = new EndTag { Capture = p, Index = ix, BeginTag = tags[ix] };
    tags[ix].EndTag = temp;
    return temp;
};

var endTags = match.Groups["END_INNER_TEXT"].Captures.OfType<Capture>().Select((p, ix) => func(p, ix));
var text = match.Groups["TEXT"].Captures.OfType<Capture>().Select((p, ix) => new Text { Capture = p });
var unmatchedTags = match.Groups["UNMATCHED_TAG"].Captures.OfType<Capture>().Select((p, ix) => new UnmatchedTag { Capture = p });

// Here you have all the tags and the inner text neatly ordered and ready to be recomposed in a StringBuilder.
var allTags = tags.Cast<BaseTagMatch>().Union(endTags).Union(text).Union(unmatchedTags).ToList();
allTags.Sort((p, q) => p.Capture.Index - q.Capture.Index);

foreach (var el in allTags)
{
    var type = el.GetType();

    if (type == typeof(BeginTag))
    {

    }
    else if (type == typeof(EndTag))
    {

    }
    else if (type == typeof(UnmatchedTag))
    {

    }
    else
    {
        // Text
    }
}

Case insensitive tag matching, ignores tags not correctly closed, supports immediately closed tags ([BR/]). And someone told it wasn't possible with Regex.... Bwahahahahah!

TAG, TAGOPTION, BEGIN_INNER_TEXT and END_INNER_TEXT are matched (they always have the same number of elements). TEXT and UNMATCHED_TAG AREN'T matched! TAG and TAG_OPTION are auto-explicative (both are stripped of useless spaces). BEGIN_INNER_TEXT and END_INNER_TEXT captures are always empty, but you can use their Index property to see where the tags begin/end. UNMATCHED_TAG contains the tags that have been opened but not closed, or closed but not opponed. It doesn't contain tags that are wrong in format (for example [123]).

In the end I take the TAG, END_INNER_TEXT (to see where the tags end), TEXT and UNMATCHED_TAG and sort them by index. Then you can take the allTags, put it in a foreach and for each element test its type. Easy :-) :-)

As a small note, the Regex is RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase. The first two are to make it easier to write and to read, the third one is semanthical. It makes [A] match with [/a].

Necessary readings:

http://www.codeproject.com/KB/recipes/Nested_RegEx_explained.aspx http://www.codeproject.com/KB/recipes/RegEx_Balanced_Grouping.aspx

I'm not sure where regex is going to benefit you. It'd be very basic, but you could just replace [quote] with <div class="quote"> and [/quote] with </div>. The same could be said for all of the other bbcode-style tags you want to allow.

In other words, literally translate them in to the html you want them to represent.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!