Do interpreters and compilers compare (and ultimately match) two strings for a potential match in a character-by-character and left-to-right fashion? Or is there an underlying
The way regular expressions work is an implementation detail. They can be implemented one way, or second way.
In fact, some of the languages implement them inefficiently.
If you want to understand more, I can refer you to this article: https://swtch.com/~rsc/regexp/regexp1.html
Regular expressions don't specify implementation details. It's up to the language to decide the best way to compile them to machine code. For example, this regex.c implementation looks like it goes more than one character at a time.
There are two big families of regex engines, called NFA and DFA (I'm using the terminology from Jeffrey Friedl's book):
A NFA implementation will roughly work the following way:
Then use the pattern as a recipe of how to advance in the input string. If the pattern says a
for instance, and if the current input offset points to an a
character, then that character will be consumed and both pointers will advance to the next position. If the character doesn't match, there will be a backtrack (the input pointer will go to a previous valid position and the pattern pointer will be set to a different possible alternative at the input position).
The point is that the recognition is driven by the pattern.
(the above explanation is very rough, as it doesn't include optimizations etc - and modern patterns cannot be implemented with a formal automaton anyway)
A DFA implementation works the other way around:
There is still one input pointer, but there are multiple pattern pointers. The input pattern will advance character by character, and the pattern pointers will keep track of a valid state in the pattern for the given input.
The recognition is driven by the input.
Both these methods have very different properties:
Some regex engines (such as PCRE) can implement both recognition methods. I recommend you read the PCRE docs about the two matching algorithms, which explain the differences in more technical terms.
As to the actual implementation, it highly depends on the regex engine itself. PCRE has several of them:
So you can actually see there are several possible approaches for NFA alone. Other engines have different implementations that allow for a different feature set. For instance, .NET's regexes can be matched left-to-right, or right-to-left and thus can easily provide variable-length lookbehind, whereas PCRE's lookbehind is implemented by temporarily shifting the input pointer to the left by the lookbehind's expected input length, and performing a left-to-right match from this offset.