I have a bunch of SKUs (stock keeping units) that represent a series of strings that I\'d like to create a single Regex to match for.
So, for example, if I have SKUs
This is what I finally worked out:
var skus = new[] { "BATPAG003", "BATTWLP03", "BATTWLP04", "BATTWSP04", "SPIFATB01" };
Func<IEnumerable<IGrouping<string, string>>, IEnumerable<string>> regexify = null;
Func<IEnumerable<string>, IEnumerable<string>> generate =
xs =>
from n in Enumerable.Range(2, 20)
let g = xs.GroupBy(x => new String(x.Take(n).ToArray()), x => new String(x.Skip(n).ToArray()))
where g.Count() != xs.Count()
from r in regexify(g)
select r;
regexify = gxs =>
{
if (!gxs.Any())
{
return new [] { "" };
}
else
{
var rs = regexify(gxs.Skip(1)).ToArray();
return
from f in gxs.Take(1)
from z in new [] { String.Join("|", f) }.Concat(f.Count() > 1 ? generate(f) : Enumerable.Empty<string>())
from r in rs
select f.Key + (f.Count() == 1 ? z : $"({z})") + (r != "" ? "|" + r : "");
}
};
Then using this query:
generate(skus).OrderBy(x => x).OrderBy(x => x.Length);
...I got this result:
BAT(PAG003|TW(LP0(3|4)|SP04))|SPIFATB01 BAT(PAG003|TWLP0(3|4)|TWSP04)|SPIFATB01 BA(TPAG003|TTW(LP0(3|4)|SP04))|SPIFATB01 BAT(PAG003|TW(LP(03|04)|SP04))|SPIFATB01 BAT(PAG003|TW(LP03|LP04|SP04))|SPIFATB01 BAT(PAG003|TWLP(03|04)|TWSP04)|SPIFATB01 BATPAG003|BATTW(LP0(3|4)|SP04)|SPIFATB01 BA(TPAG003|TT(WLP0(3|4)|WSP04))|SPIFATB01 BA(TPAG003|TTW(LP(03|04)|SP04))|SPIFATB01 BA(TPAG003|TTW(LP03|LP04|SP04))|SPIFATB01 BA(TPAG003|TTWLP0(3|4)|TTWSP04)|SPIFATB01 BAT(PAG003|TWL(P0(3|4))|TWSP04)|SPIFATB01 BAT(PAG003|TWL(P03|P04)|TWSP04)|SPIFATB01 BATPAG003|BATT(WLP0(3|4)|WSP04)|SPIFATB01 BATPAG003|BATTW(LP(03|04)|SP04)|SPIFATB01 BATPAG003|BATTW(LP03|LP04|SP04)|SPIFATB01 BA(TPAG003|TT(WLP(03|04)|WSP04))|SPIFATB01 BA(TPAG003|TTWLP(03|04)|TTWSP04)|SPIFATB01 BAT(PAG003|TWLP03|TWLP04|TWSP04)|SPIFATB01 BATPAG003|BATT(WLP(03|04)|WSP04)|SPIFATB01 BA(TPAG003|TT(WL(P0(3|4))|WSP04))|SPIFATB01 BA(TPAG003|TT(WL(P03|P04)|WSP04))|SPIFATB01 BA(TPAG003|TT(WLP03|WLP04|WSP04))|SPIFATB01 BA(TPAG003|TTWL(P0(3|4))|TTWSP04)|SPIFATB01 BA(TPAG003|TTWL(P03|P04)|TTWSP04)|SPIFATB01 BATPAG003|BATT(WL(P0(3|4))|WSP04)|SPIFATB01 BATPAG003|BATT(WL(P03|P04)|WSP04)|SPIFATB01 BATPAG003|BATT(WLP03|WLP04|WSP04)|SPIFATB01 BATPAG003|BATTWLP0(3|4)|BATTWSP04|SPIFATB01 BATPAG003|BATTWLP(03|04)|BATTWSP04|SPIFATB01 BA(TPAG003|TTWLP03|TTWLP04|TTWSP04)|SPIFATB01 BATPAG003|BATTWL(P0(3|4))|BATTWSP04|SPIFATB01 BATPAG003|BATTWL(P03|P04)|BATTWSP04|SPIFATB01
The only problem with my approach was computation time. Some of my source lists have nearly 100 SKUs. Some of the runs were taking longer than I care to wait for and had to break it down into smaller chunks and then manually concatenate.
This works if each SKU id have the same length.
// ...
string regexStr = Calculate(skus);
// ...
public static string Calculate(IEnumerable<string> rest) {
if (rest.First().Length > 0) {
string[] groups = rest.GroupBy(r => r[0])
.Select(g => g.Key + Calculate(g.Select(e => e.Substring(1))))
.ToArray();
return groups.Length > 1 ? "(" + string.Join("|", groups) + ")" : groups[0];
} else {
return string.Empty;
}
}
Take the entire list of all of your sku's and make a single ternary tree regex.
When you add or delete sku's, regenerate the regex. Maybe your database
generates on a weekly basis.
This utility makes a regex of 10,000 strings in less than half a second
and size is not important, it could be 300,000 strings.
For example, here is regex of 175,000 word dictionary.