How to handle multi-line comments in a live syntax highlighter?

问题

I'm writing my own text editor with syntax highlighting in Java, and at the moment it simply parses and highlights the current line every time the user enters a single character. While presumably not the most efficient way, it's good enough and doesn't cause any noticeable performance issues. In pseudo-Java, this would be the core concept of my code:

public void textUpdated(String wholeText, int updateOffset, int updateLength) {
    int lineStart = getFirstLineStart(wholeText, updateOffset);
    int lineEnd = getLastLineEnd(wholeText, updateOffset + updateLength);

    List<Token> foundTokens = tokenizeText(wholeText, lineStart, lineEnd);

    for(Token token : foundTokens) {
        highlightText(token.offset, token.length, token.tokenType);
    }
}

The real problem lies with multi-line comments. To check if an entered character is inside a multi-line comment, the program would need to parse back to the most recent occurrence of a "/*", while also being aware of whether this occurrence is inside a literal or another comment. This would not be an issue if the amount of text is small, but if the text consists of 20,000 lines of code, it would possibly have to scan and (re)highlight 20,000 lines of code on each key press, which would be very inefficient.

So my ultimate question is: how do I handle multi-line tokens/comments in a syntax highlighter while keeping it efficient?

回答1:

I tried to do this (for fun) about 10 years ago (or more). Because the code is so old, I don't remember all the details of the code and the logic conditions in the code. All the code here is basically a brute force solution. It in no way attempts to keep the state of each line as suggest by rici.

I'll try to explain the high level concept of the code. Hope some of it makes sense to you.

at the moment it simply parses and highlights the current line every time the user enters a single character.

This is the basic premise of my code as well. However, it does handle pasting multiple lines of code as well.

how do I handle multi-line tokens/comments in a syntax highlighter while keeping it efficient?

In my solution, when you enter "/*" to start the multi-line comment, I will comment all the following lines of code until I find the end of the comment or the start of the another multi-line comment or the end of the Document. When you then enter the matching "*/" to end the multi-line comment I will re-highlight the following lines until the next multi-line comment or the end of the Document.

So the amount of highlighting done depends on how much code you have between multi-line comments.

That is a quick overview of how it works. I doubt it is 100% accurate since I've only played with it a little bit. It should be noted this code was written when I was just learning Java, so in no way would I suggest it is the best approach, just the best I knew at the time.

Here is the code for your amusement :)

Just run the code and click on the button to get started.

import java.awt.*;
import java.awt.event.*;
import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.*;
import javax.swing.event.*;
import javax.swing.text.*;

class SyntaxDocument extends DefaultStyledDocument
{
    private DefaultStyledDocument doc;
    private Element rootElement;

    private boolean multiLineComment;
    private MutableAttributeSet normal;
    private MutableAttributeSet keyword;
    private MutableAttributeSet comment;
    private MutableAttributeSet quote;

    private Set<String> keywords;

    private int lastLineProcessed = -1;

    public SyntaxDocument()
    {
        doc = this;
        rootElement = doc.getDefaultRootElement();
        putProperty( DefaultEditorKit.EndOfLineStringProperty, "\n" );

        normal = new SimpleAttributeSet();
        StyleConstants.setForeground(normal, Color.black);

        comment = new SimpleAttributeSet();
        StyleConstants.setForeground(comment, Color.gray);
        StyleConstants.setItalic(comment, true);

        keyword = new SimpleAttributeSet();
        StyleConstants.setForeground(keyword, Color.blue);

        quote = new SimpleAttributeSet();
        StyleConstants.setForeground(quote, Color.red);

        keywords = new HashSet<String>();
        keywords.add( "abstract" );
        keywords.add( "boolean" );
        keywords.add( "break" );
        keywords.add( "byte" );
        keywords.add( "byvalue" );
        keywords.add( "case" );
        keywords.add( "cast" );
        keywords.add( "catch" );
        keywords.add( "char" );
        keywords.add( "class" );
        keywords.add( "const" );
        keywords.add( "continue" );
        keywords.add( "default" );
        keywords.add( "do" );
        keywords.add( "double" );
        keywords.add( "else" );
        keywords.add( "extends" );
        keywords.add( "false" );
        keywords.add( "final" );
        keywords.add( "finally" );
        keywords.add( "float" );
        keywords.add( "for" );
        keywords.add( "future" );
        keywords.add( "generic" );
        keywords.add( "goto" );
        keywords.add( "if" );
        keywords.add( "implements" );
        keywords.add( "import" );
        keywords.add( "inner" );
        keywords.add( "instanceof" );
        keywords.add( "int" );
        keywords.add( "interface" );
        keywords.add( "long" );
        keywords.add( "native" );
        keywords.add( "new" );
        keywords.add( "null" );
        keywords.add( "operator" );
        keywords.add( "outer" );
        keywords.add( "package" );
        keywords.add( "private" );
        keywords.add( "protected" );
        keywords.add( "public" );
        keywords.add( "rest" );
        keywords.add( "return" );
        keywords.add( "short" );
        keywords.add( "static" );
        keywords.add( "super" );
        keywords.add( "switch" );
        keywords.add( "synchronized" );
        keywords.add( "this" );
        keywords.add( "throw" );
        keywords.add( "throws" );
        keywords.add( "transient" );
        keywords.add( "true" );
        keywords.add( "try" );
        keywords.add( "var" );
        keywords.add( "void" );
        keywords.add( "volatile" );
        keywords.add( "while" );
    }

    /*
     *  Override to apply syntax highlighting after the document has been updated
     */
    public void insertString(int offset, String str, AttributeSet a) throws BadLocationException
    {
        if (str.equals("{"))
            str = addMatchingBrace(offset);

        super.insertString(offset, str, a);
        processChangedLines(offset, str.length());
    }

    /*
     *  Override to apply syntax highlighting after the document has been updated
     */
    public void remove(int offset, int length) throws BadLocationException
    {
        super.remove(offset, length);
        processChangedLines(offset, 0);
    }

    /*
     *  Determine how many lines have been changed,
     *  then apply highlighting to each line
     */
    public void processChangedLines(int offset, int length)
        throws BadLocationException
    {
        String content = doc.getText(0, doc.getLength());

        //  The lines affected by the latest document update

        int startLine = rootElement.getElementIndex(offset);
        int endLine = rootElement.getElementIndex(offset + length);

        if (startLine > endLine)
            startLine = endLine;

        //  Make sure all comment lines prior to the start line are commented
        //  and determine if the start line is still in a multi line comment

        if (startLine != lastLineProcessed
        &&  startLine != lastLineProcessed + 1)
        {
            setMultiLineComment( commentLinesBefore( content, startLine ) );
        }

        //  Do the actual highlighting

        for (int i = startLine; i <= endLine; i++)
        {
            applyHighlighting(content, i);
        }

        //  Resolve highlighting to the next end multi line delimiter

        if (isMultiLineComment())
            commentLinesAfter(content, endLine);
        else
            highlightLinesAfter(content, endLine);

    }

    /*
     *  Highlight lines when a multi line comment is still 'open'
     *  (ie. matching end delimiter has not yet been encountered)
     */
    private boolean commentLinesBefore(String content, int line)
    {
        int offset = rootElement.getElement( line ).getStartOffset();

        //  Start of comment not found, nothing to do

        int startDelimiter = lastIndexOf( content, getStartDelimiter(), offset - 2 );

        if (startDelimiter < 0)
            return false;

        //  Matching start/end of comment found, nothing to do

        int endDelimiter = indexOf( content, getEndDelimiter(), startDelimiter );

        if (endDelimiter < offset & endDelimiter != -1)
            return false;

        //  End of comment not found, highlight the lines

        doc.setCharacterAttributes(startDelimiter, offset - startDelimiter + 1, comment, false);
        return true;
    }

    /*
     *  Highlight comment lines to matching end delimiter
     */
    private void commentLinesAfter(String content, int line)
    {
        int offset = rootElement.getElement( line ).getStartOffset();

        //  End of comment and Start of comment not found
        //  highlight until the end of the Document

        int endDelimiter = indexOf( content, getEndDelimiter(), offset );

        if (endDelimiter < 0)
        {
            endDelimiter = indexOf( content, getStartDelimiter(), offset + 2);

            if (endDelimiter < 0)
            {
                doc.setCharacterAttributes(offset, content.length() - offset + 1, comment, false);
                return;
            }
        }

        //  Matching start/end of comment found, comment the lines

        int startDelimiter = lastIndexOf( content, getStartDelimiter(), endDelimiter );

        if (startDelimiter < 0 || startDelimiter >= offset)
        {
            doc.setCharacterAttributes(offset, endDelimiter - offset + 1, comment, false);
        }
    }

    /*
     *  Highlight lines to start or end delimiter
     */
    private void highlightLinesAfter(String content, int line)
        throws BadLocationException
    {
        int offset = rootElement.getElement( line ).getEndOffset();

        //  Start/End delimiter not found, nothing to do

        int startDelimiter = indexOf( content, getStartDelimiter(), offset );
        int endDelimiter = indexOf( content, getEndDelimiter(), offset );

        if (startDelimiter < 0)
            startDelimiter = content.length();

        if (endDelimiter < 0)
            endDelimiter = content.length();

        int delimiter = Math.min(startDelimiter, endDelimiter);

        if (delimiter < offset)
            return;

        //  Start/End delimiter found, reapply highlighting

        int endLine = rootElement.getElementIndex( delimiter );

        for (int i = line + 1; i <= endLine; i++)
        {
            Element branch = rootElement.getElement( i );
            Element leaf = doc.getCharacterElement( branch.getStartOffset() );
            AttributeSet as = leaf.getAttributes();

            if ( as.isEqual(comment) )
            {
                applyHighlighting(content, i);
            }
        }
    }

    /*
     *  Parse the line to determine the appropriate highlighting
     */
    private void applyHighlighting(String content, int line)
        throws BadLocationException
    {
        lastLineProcessed = line;

        int startOffset = rootElement.getElement( line ).getStartOffset();
        int endOffset = rootElement.getElement( line ).getEndOffset() - 1;

        int lineLength = endOffset - startOffset;
        int contentLength = content.length();

        if (endOffset >= contentLength)
            endOffset = contentLength - 1;

        //  check for multi line comments
        //  (always set the comment attribute for the entire line)

        if (endingMultiLineComment(content, startOffset, endOffset)
        ||  isMultiLineComment()
        ||  startingMultiLineComment(content, startOffset, endOffset) )
        {
            doc.setCharacterAttributes(startOffset, endOffset - startOffset + 1, comment, false);
            lastLineProcessed = -1;
            return;
        }

        //  set normal attributes for the line

        doc.setCharacterAttributes(startOffset, lineLength, normal, true);

        //  check for single line comment

        int index = content.indexOf(getSingleLineDelimiter(), startOffset);

        if ( (index > -1) && (index < endOffset) )
        {
            doc.setCharacterAttributes(index, endOffset - index + 1, comment, false);
            endOffset = index - 1;
        }

        //  check for tokens

        checkForTokens(content, startOffset, endOffset);
    }

    /*
     *  Does this line contain the start delimiter
     */
    private boolean startingMultiLineComment(String content, int startOffset, int endOffset)
        throws BadLocationException
    {
        int index = indexOf( content, getStartDelimiter(), startOffset );

        if ( (index < 0) || (index > endOffset) )
            return false;
        else
        {
            setMultiLineComment( true );
            return true;
        }
    }

    /*
     *  Does this line contain the end delimiter
     */
    private boolean endingMultiLineComment(String content, int startOffset, int endOffset)
        throws BadLocationException
    {
        int index = indexOf( content, getEndDelimiter(), startOffset );

        if ( (index < 0) || (index > endOffset) )
            return false;
        else
        {
            setMultiLineComment( false );
            return true;
        }
    }

    /*
     *  We have found a start delimiter
     *  and are still searching for the end delimiter
     */
    private boolean isMultiLineComment()
    {
        return multiLineComment;
    }

    private void setMultiLineComment(boolean value)
    {
        multiLineComment = value;
    }

    /*
     *  Parse the line for tokens to highlight
     */
    private void checkForTokens(String content, int startOffset, int endOffset)
    {
        while (startOffset <= endOffset)
        {
            //  skip the delimiters to find the start of a new token

            while ( isDelimiter( content.substring(startOffset, startOffset + 1) ) )
            {
                if (startOffset < endOffset)
                    startOffset++;
                else
                    return;
            }

            //  Extract and process the entire token

            if ( isQuoteDelimiter( content.substring(startOffset, startOffset + 1) ) )
                startOffset = getQuoteToken(content, startOffset, endOffset);
            else
                startOffset = getOtherToken(content, startOffset, endOffset);
        }
    }

    /*
     *
     */
    private int getQuoteToken(String content, int startOffset, int endOffset)
    {
        String quoteDelimiter = content.substring(startOffset, startOffset + 1);
        String escapeString = getEscapeString(quoteDelimiter);

        int index;
        int endOfQuote = startOffset;

        //  skip over the escape quotes in this quote

        index = content.indexOf(escapeString, endOfQuote + 1);

        while ( (index > -1) && (index < endOffset) )
        {
            endOfQuote = index + 1;
            index = content.indexOf(escapeString, endOfQuote);
        }

        // now find the matching delimiter

        index = content.indexOf(quoteDelimiter, endOfQuote + 1);

        if ( (index < 0) || (index > endOffset) )
            endOfQuote = endOffset;
        else
            endOfQuote = index;

        doc.setCharacterAttributes(startOffset, endOfQuote - startOffset + 1, quote, false);

        return endOfQuote + 1;
    }

    /*
     *
     */
    private int getOtherToken(String content, int startOffset, int endOffset)
    {
        int endOfToken = startOffset + 1;

        while ( endOfToken <= endOffset )
        {
            if ( isDelimiter( content.substring(endOfToken, endOfToken + 1) ) )
                break;

            endOfToken++;
        }

        String token = content.substring(startOffset, endOfToken);

        if ( isKeyword( token ) )
        {
            doc.setCharacterAttributes(startOffset, endOfToken - startOffset, keyword, false);
        }

        return endOfToken + 1;
    }

    /*
     *  Assume the needle will be found at the start/end of the line
     */
    private int indexOf(String content, String needle, int offset)
    {
        int index;

        while ( (index = content.indexOf(needle, offset)) != -1 )
        {
            String text = getLine( content, index ).trim();

            if (text.startsWith(needle) || text.endsWith(needle))
                break;
            else
                offset = index + 1;
        }

        return index;
    }

    /*
     *  Assume the needle will the found at the start/end of the line
     */
    private int lastIndexOf(String content, String needle, int offset)
    {
        int index;

        while ( (index = content.lastIndexOf(needle, offset)) != -1 )
        {
            String text = getLine( content, index ).trim();

            if (text.startsWith(needle) || text.endsWith(needle))
                break;
            else
                offset = index - 1;
        }

        return index;
    }

    private String getLine(String content, int offset)
    {
        int line = rootElement.getElementIndex( offset );
        Element lineElement = rootElement.getElement( line );
        int start = lineElement.getStartOffset();
        int end = lineElement.getEndOffset();
        return content.substring(start, end - 1);
    }

    /*
     *  Override for other languages
     */
    protected boolean isDelimiter(String character)
    {
        String operands = ";:{}()[]+-/%<=>!&|^~*";

        if (Character.isWhitespace( character.charAt(0) )
        ||  operands.indexOf(character) != -1 )
            return true;
        else
            return false;
    }

    /*
     *  Override for other languages
     */
    protected boolean isQuoteDelimiter(String character)
    {
        String quoteDelimiters = "\"'";

        if (quoteDelimiters.indexOf(character) < 0)
            return false;
        else
            return true;
    }

    /*
     *  Override for other languages
     */
    protected boolean isKeyword(String token)
    {
        return keywords.contains( token );
    }

    /*
     *  Override for other languages
     */
    protected String getStartDelimiter()
    {
        return "/*";
    }

    /*
     *  Override for other languages
     */
    protected String getEndDelimiter()
    {
        return "*/";
    }

    /*
     *  Override for other languages
     */
    protected String getSingleLineDelimiter()
    {
        return "//";
    }

    /*
     *  Override for other languages
     */
    protected String getEscapeString(String quoteDelimiter)
    {
        return "\\" + quoteDelimiter;
    }

    /*
     *
     */
    protected String addMatchingBrace(int offset) throws BadLocationException
    {
        StringBuffer whiteSpace = new StringBuffer();
        int line = rootElement.getElementIndex( offset );
        int i = rootElement.getElement(line).getStartOffset();

        while (true)
        {
            String temp = doc.getText(i, 1);

            if (temp.equals(" ") || temp.equals("\t"))
            {
                whiteSpace.append(temp);
                i++;
            }
            else
                break;
        }

        return "{\n" + whiteSpace.toString() + "\t\n" + whiteSpace.toString() + "}";
    }
/*
    public void setCharacterAttributes(int offset, int length, AttributeSet s, boolean replace)
    {
        super.setCharacterAttributes(offset, length, s, replace);
    }
*/


    public static void main(String a[])
    {

        EditorKit editorKit = new StyledEditorKit()
        {
            public Document createDefaultDocument()
            {
                return new SyntaxDocument();
            }
        };

//      final JEditorPane edit = new JEditorPane()
        final JTextPane edit = new JTextPane();
//      LinePainter painter = new LinePainter(edit, Color.cyan);
//      LinePainter2 painter = new LinePainter2(edit, Color.cyan);
//      edit.setEditorKitForContentType("text/java", editorKit);
//      edit.setContentType("text/java");
        edit.setEditorKit(editorKit);

        JButton button = new JButton("Load SyntaxDocument.java");
        button.addActionListener( new ActionListener()
        {
            public void actionPerformed(ActionEvent e)
            {
                try
                {
                    long startTime = System.currentTimeMillis();
                    FileReader fr = new FileReader( "SyntaxDocument.java" );
//                  FileReader fr = new FileReader( "C:\\Java\\j2sdk1.4.2\\src\\javax\\swing\\JComponent.java" );

                    BufferedReader br = new BufferedReader(fr);
                    edit.read( br, null );

                    System.out.println("Load: " + (System.currentTimeMillis() - startTime));
                    System.out.println("Document contains: " + edit.getDocument().getLength() + " characters");
                    edit.requestFocus();
                }
                catch(Exception e2) {}
            }
        });

        JFrame frame = new JFrame("Syntax Highlighting");
        frame.getContentPane().add( new JScrollPane(edit) );
        frame.getContentPane().add(button, BorderLayout.SOUTH);
        frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
        frame.setSize(800,300);
        frame.setVisible(true);
    }
}

Note: this code does not check if the comment delimiters are inside a literal, so that would need to be improved upon.

I don't really expect you to use this code, but I thought it might give you an idea of the performance you might get when using the brute force approach.

回答2:

One common approach is to save the lexer state at the start of each line. (Typically, the lexer state will be a small integer or enum; for Java-like languages, it would probably be limited to three values: normal, inside multiline comment, and inside multiline string constant.)

A change to a line could change the lexer state at the start of the next line, but it can't change the state at the beginning of the current line, so the retokenisation of the line can be done from the start of the line, using the current line's lexer state as a starting condition. Keeping per-line lexer states makes it easy to handle the case where the cursor is moved to another line, possibly quite some distance away.

If the edit changes the lexer state at the end of the line (which is to say the start of the next line) you could rescan the rest of the file. However, doing so immediately is really annoying for the user because it means that every time they type a quote, the entire scrern gets repainted, because it has become part of a multiline string (for example). Since most of the time, the user wil close the string (or comment), it is usually better to delay the rescan. For example, you might wait until the user moves the cursor or completes the lexical element or some other such signal. Another comon approach is to insert a "ghost" close symbol after the cursor, which will keep the lex in sync. The ghost will be deleted if the user types it explicitly, or deletes it explicitly.

You seem to be keeping the entire program as a single string. IMHO, it's better to keep it as a list of lines, to avoid having to copy the entire string when a character is inserted or deleted. Otherwise, editing very long files becomes really annoying.

Finally, you should never tokenise text which is not visible. Avoiding that will limit the damage of large retokenisations.

来源：https://stackoverflow.com/questions/27332210/how-to-handle-multi-line-comments-in-a-live-syntax-highlighter

标签

java

parsing

syntax-highlighting

tokenize

lexer