Grammatical inference of regular expressions for given finite list of representative strings?

删除回忆录丶 提交于 2019-11-27 13:11:38
Stephen Lin

Yes, it turns out this does exist; what is required is what is known academically as a DFA Learning algorithm, examples of which include:

  • Angluin's L*
  • L* (adding counter-examples to columns)
  • Kearns / Vazirani
  • Rivest / Schapire
  • NL*
  • Regular positive negative inference (RPNI)
  • DeLeTe2
  • Biermann & Feldman's algorithm
  • Biermann & Feldman's algorithm (using SAT-solving)

Source for the above is libalf, an open-source automata learning algorithm framework in C++; descriptions of at least some of these algorithms can be found in this textbook, among others. There are also implementations of grammatical inference algorithms (including DFA learning) in gitoolbox for MATLAB.

Since this question has come up before and has not been satisfactorily answered in the past, I am in the process of evaluating these algorithms and will update will more information about how useful they are, unless someone with more expertise in the area does first (which is preferable).

NOTE: I am accepting my own answer for now but will gladly accept a better one if someone can provide one.

FURTHER NOTE: I've decided to go with the route of using custom code, since using a generic algorithm turns out to be a bit overkill for the data I'm working with. I'm leaving this answer here in case someone else needs it, and will update if I ever do evaluate these.

The only thing I can suggest is to play around with Nltk (Natural Language Toolkit for Python) a bit and see if it can at least recognize recurring patterns.

Another thing you may look into is MALLET (Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction etc.)

Perl has something called LinkParser but it seems to require you to provide a representation of the actual grammar (on the other hand, it comes with a large set of different models so maybe it could be shoehorned to help you sorting your samples).

Gate may allow you to create examples from a subset of records in your corpus and possibly reverse engineer the grammar from those.

Finally, have a look at the CRAN repository for text-specific packages.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!