lexical-analysis | 易学教程

Python regular expressions - how to capture multiple groups from a wildcard expression?

阅读更多关于 Python regular expressions - how to capture multiple groups from a wildcard expression?

问题 I have a Python regular expression that contains a group which can occur zero or many times - but when I retrieve the list of groups afterwards, only the last one is present. Example: re.search("(\w)*", "abcdefg").groups () this returns the list ('g',) I need it to return ('a','b','c','d','e','f','g',) Is that possible? How can I do it? 回答1: In addition to Douglas Leeder's solution, here is the explanation: In regular expressions the group count is fixed. Placing a quantifier behind a group

checking unfinished comments in flex

阅读更多关于 checking unfinished comments in flex

问题 I am a new to flex. I have just written a sample code to detect multi line comments using a flex program. Now I want to improve the code. I want to detect unfinished and ill formed comments in the code. for example: a comment beginning with /* without an ending */ is an unfinished comment and by ill formed comment I mean the comment is not properly formed, say, an EOF appears inside the comment etc. What I have to add in my code to check these things? My sample code is as follows: %x COMMENT

A weird situation about matching in flex

阅读更多关于 A weird situation about matching in flex

问题 I am writing a scanner in flex, and I have the following two definitions: %% "int" printf("JUST_INT"); "int"[ \t\n]+"matrix" printf("MATRIX_INT"); [A-Za-z][A-Za-z0-9]* printf("IDENTIFIER"); %% When the input to the scanner is int matrixM = 3; the output is MATRIX_INT IDENTIFIER . It sees the input as int matrix M=3 . But actually, this is not a matrix , the name of identifier is matrixM . Output should be JUST_INT IDENTIFIER . Why is this happening? Is that because of my definitions? 回答1: It

How to exclude more than one character in rule?

阅读更多关于 How to exclude more than one character in rule?

问题 I'm trying to write a string matching rule in ANTLRWorks, and I need to match either escaped quotes or any non quote character. I can match escaped quotes but I'm having trouble with the other part: ~'\'' | ~'\"' will end up matching everything and ~'\'\"' seems to be ignored by the grammar generator (at least the visual display). What sequence of characters will get me what I want? 回答1: Try something like this: StringLiteral : '"' (EscapeSequence | StringChar)* '"' ; EscapeSequence : '\\' ('

Where to find an exhaustive list of stop words?

阅读更多关于 Where to find an exhaustive list of stop words?

问题 Where could I find an exhaustive list of stop words? The one I have is quite short and it seems to be inapplicable to scientific texts. I am creating lexical chains to extract key topics from scientific papers. The problem is that words like based , regarding , etc. should also be considered as stop words as they do not deliver much sense. 回答1: You can also easily add to existing stop word lists. E.g. use the one in the NLTK toolkit: from nltk.corpus import stopwords and then add whatever you

How can I find only 'interesting' words from a corpus?

阅读更多关于 How can I find only 'interesting' words from a corpus?

问题 I am parsing sentences. I want to know the relevant content of each sentence, defined loosely as "semi-unique words" in relation to the rest of the corpus. Something similar to Amazon's "statistically improbable phrases", which seem to (often) convey the character of a book through oddball strings of words. My first pass was to start making a common words list. This knocks out the easy ones like a , the , from , etc. Obviously, it turns out that this list gets quite long. One idea is to

literals extraction policy for a lexical Analyzer

阅读更多关于 literals extraction policy for a lexical Analyzer

问题 I have built a lexical analyzer for a C like language which for example given this input produces the following result. Input int i = 0 ; int j = i + 3; Output int KEYWORD i IDENTIFIER = OPERATOR ; PUNCTUATION int KEYWORD j IDENTIFIER = OPERATOR i IDENTIFIER + OPERATOR 3 INTEGER_CONSTANT ; PUNCTUATION In the above example you may have noticed the given input was syntactically correct, however when I give it something like below it fails. Input int i = "1.2.2222.+\<++++ I have made a class

Is it possible to let several lexers share same ident definitions?

阅读更多关于 Is it possible to let several lexers share same ident definitions?

问题 I have several lexers: lexer_1.mll , lexer_2.mll , ... Some definitions of ident ( let ident = regexp ) are common and repeated in these files. For instance, the definition of INTEGER , FLOAT , ... Does anyone know if it is possible to define them the once for all somewhere, and let the .mll files call it? 回答1: I'm afraid there's no "pure OCaml" solution, as ident seems to be systematically inlined by ocamllex . You can still put your regexp definition in a file, and use cpp (or any other C

Javascript Compiler / Lexer

阅读更多关于 Javascript Compiler / Lexer

问题 I'm looking for a free open-source Javascript/ECMAScript compiler or lexical analyzer, in order to compile javascript code to an executable that can run on Linux. I understand that I will probably need to add some objects/APIs for the javascript code to be able to perform operations within the O/S. I was hoping that some of this work has already been done somewhere. Any pointers anyone? Thanks! 回答1: You can try Rhino, a free Javascript implementation written in Java. If you need an

Make a table containing tokens visible for both .mly an .mll

阅读更多关于 Make a table containing tokens visible for both .mly an .mll

问题 I would like to define a keyword_table which maps some strings to some tokens, and I would like to make this table visible for both parser.mly and lexer.mll . It seems that the table has to be defined in parser.mly , %{ open Utility (* where hash_table is defined to make a table from a list *) let keyword_table = hash_table [ "Call", CALL; "Case", CASE; "Close", CLOSE; "Const", CONST; "Declare", DECLARE; "DefBool", DEFBOOL; "DefByte", DEFBYTE ] %} However, I could NOT use it in lexer.mll ,