What is regular expression for multi string?

后端 未结 1 1776
梦毁少年i
梦毁少年i 2021-01-25 08:05

I am learning to make a compiler and it\'s got some rules like single string:

char ch[] =\"abcd\";

and multi string:

printf(\"         


        
1条回答
  •  清酒与你
    2021-01-25 08:55

    A common string pattern is

    \"([^"\\\n]|\\(.|\n))*\"
    

    This will match strings which include escaped double quotes (\") and backslashes (\\). It uses \\(.|\n) to allow any character after a backslash. Although some backslash sequences are longer than one character (\x40), none of them include non-alphanumerics after the first character.

    It is possible that your input includes Windows line endings (CR-LF), in which case the backslash will not be directly followed by a newline; it will be followed by a carriage return. If you want to accept that input rather than throwing an error (which might be more appropriate), you need to do so explicitly:

    \"([^"\\\n]|\\(.|\r?\n))*\"
    

    But recognising a string and understanding what the string represents are two different things. Normally a compiler will need to turn the representation of a string into a byte sequence and that requires, for example, turning \n into the byte 10 and removing backslashed newlines altogether.

    That task can easily be done in a (f)lex scanner using start conditions. (Or, of course, you can rescan the string using a different lexical scanner.)

    Additionally, you need to think about error-handling. Once you ban strings with unescaped newlines (as C does), you open the door to the possibility of an unterminated string, where a newline is encountered before the closing quote. The same could happen at the end of the file if a string is not correctly​ closed.

    If you have a single-character fallback rule, it will recognise the opening quote of an unterminated string. This is not desirable because it will then scan the contents of the string as program text leading to a cascade of errors. If you are not attempting error recovery it doesn't matter, but if you are it is usually better to at least recognize the unterminated string as such up to the newline, using a different pattern.

    0 讨论(0)
提交回复
热议问题