Grammatical inference of regular expressions for given finite list of representative strings?
I'm working on analyzing a large public dataset with lots of verbose human-readable strings that were clearly generated by some regular (in the formal language theory sense) grammar. It's not too hard to look at sets of these strings one by one to see the patterns; unfortunately, there's about 24,000 of these unique strings broken up into 33 categories and 1714 subcategories, so it's somewhat painful to do this manually. Basically, I'm looking for an existing algorithm (preferably with an existing reference implementation ) to take an arbitrary list of strings and try to infer some minimal