Generating the Shortest Regex Dynamically from a source List of Strings

前端 未结 3 630
一整个雨季
一整个雨季 2021-01-04 10:13

I have a bunch of SKUs (stock keeping units) that represent a series of strings that I\'d like to create a single Regex to match for.

So, for example, if I have SKUs

相关标签:
3条回答
  • 2021-01-04 11:12

    This is what I finally worked out:

    var skus = new[] { "BATPAG003", "BATTWLP03", "BATTWLP04", "BATTWSP04", "SPIFATB01" };
    
    Func<IEnumerable<IGrouping<string, string>>, IEnumerable<string>> regexify = null;
    
    Func<IEnumerable<string>, IEnumerable<string>> generate =
        xs =>
            from n in Enumerable.Range(2, 20)
            let g = xs.GroupBy(x => new String(x.Take(n).ToArray()), x => new String(x.Skip(n).ToArray()))
            where g.Count() != xs.Count()
            from r in regexify(g)
            select r;
    
    regexify = gxs =>
    {
        if (!gxs.Any())
        {
            return new [] { "" };
        }
        else
        {
            var rs = regexify(gxs.Skip(1)).ToArray();
            return
                from f in gxs.Take(1)
                from z in new [] { String.Join("|", f) }.Concat(f.Count() > 1 ? generate(f) : Enumerable.Empty<string>())
                from r in rs
                select f.Key + (f.Count() == 1 ? z : $"({z})") + (r != "" ? "|" + r : "");
        }
    };
    

    Then using this query:

    generate(skus).OrderBy(x => x).OrderBy(x => x.Length);
    

    ...I got this result:

    BAT(PAG003|TW(LP0(3|4)|SP04))|SPIFATB01 
    BAT(PAG003|TWLP0(3|4)|TWSP04)|SPIFATB01 
    BA(TPAG003|TTW(LP0(3|4)|SP04))|SPIFATB01 
    BAT(PAG003|TW(LP(03|04)|SP04))|SPIFATB01 
    BAT(PAG003|TW(LP03|LP04|SP04))|SPIFATB01 
    BAT(PAG003|TWLP(03|04)|TWSP04)|SPIFATB01 
    BATPAG003|BATTW(LP0(3|4)|SP04)|SPIFATB01 
    BA(TPAG003|TT(WLP0(3|4)|WSP04))|SPIFATB01 
    BA(TPAG003|TTW(LP(03|04)|SP04))|SPIFATB01 
    BA(TPAG003|TTW(LP03|LP04|SP04))|SPIFATB01 
    BA(TPAG003|TTWLP0(3|4)|TTWSP04)|SPIFATB01 
    BAT(PAG003|TWL(P0(3|4))|TWSP04)|SPIFATB01 
    BAT(PAG003|TWL(P03|P04)|TWSP04)|SPIFATB01 
    BATPAG003|BATT(WLP0(3|4)|WSP04)|SPIFATB01 
    BATPAG003|BATTW(LP(03|04)|SP04)|SPIFATB01 
    BATPAG003|BATTW(LP03|LP04|SP04)|SPIFATB01 
    BA(TPAG003|TT(WLP(03|04)|WSP04))|SPIFATB01 
    BA(TPAG003|TTWLP(03|04)|TTWSP04)|SPIFATB01 
    BAT(PAG003|TWLP03|TWLP04|TWSP04)|SPIFATB01 
    BATPAG003|BATT(WLP(03|04)|WSP04)|SPIFATB01 
    BA(TPAG003|TT(WL(P0(3|4))|WSP04))|SPIFATB01 
    BA(TPAG003|TT(WL(P03|P04)|WSP04))|SPIFATB01 
    BA(TPAG003|TT(WLP03|WLP04|WSP04))|SPIFATB01 
    BA(TPAG003|TTWL(P0(3|4))|TTWSP04)|SPIFATB01 
    BA(TPAG003|TTWL(P03|P04)|TTWSP04)|SPIFATB01 
    BATPAG003|BATT(WL(P0(3|4))|WSP04)|SPIFATB01 
    BATPAG003|BATT(WL(P03|P04)|WSP04)|SPIFATB01 
    BATPAG003|BATT(WLP03|WLP04|WSP04)|SPIFATB01 
    BATPAG003|BATTWLP0(3|4)|BATTWSP04|SPIFATB01 
    BATPAG003|BATTWLP(03|04)|BATTWSP04|SPIFATB01 
    BA(TPAG003|TTWLP03|TTWLP04|TTWSP04)|SPIFATB01 
    BATPAG003|BATTWL(P0(3|4))|BATTWSP04|SPIFATB01 
    BATPAG003|BATTWL(P03|P04)|BATTWSP04|SPIFATB01 
    

    The only problem with my approach was computation time. Some of my source lists have nearly 100 SKUs. Some of the runs were taking longer than I care to wait for and had to break it down into smaller chunks and then manually concatenate.

    0 讨论(0)
  • 2021-01-04 11:15

    This works if each SKU id have the same length.

    // ...
    string regexStr = Calculate(skus);
    // ...
    
    public static string Calculate(IEnumerable<string> rest) {
        if (rest.First().Length > 0) {
            string[] groups = rest.GroupBy(r => r[0])
                .Select(g => g.Key + Calculate(g.Select(e => e.Substring(1))))
                .ToArray();
            return groups.Length > 1 ? "(" + string.Join("|", groups) + ")" : groups[0];
        } else {
            return string.Empty;
        }
    }
    
    0 讨论(0)
  • 2021-01-04 11:19

    Take the entire list of all of your sku's and make a single ternary tree regex.
    When you add or delete sku's, regenerate the regex. Maybe your database
    generates on a weekly basis.

    This utility makes a regex of 10,000 strings in less than half a second
    and size is not important, it could be 300,000 strings.

    For example, here is regex of 175,000 word dictionary.

    0 讨论(0)
提交回复
热议问题