Looking for a clear definition of what a “tokenizer”, “parser” and “lexers” are and how they are related to each other and used?

问题

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions.

I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are there common implementations.

I have been looking at Wikipedia on these topics and programs like Lex and Yacc, but having never gone through a compiler class (EE major) I am finding it difficult to fully understand what is going on.

回答1:

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).

A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.

A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.

Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".

回答2:

Example:

int x = 1;

A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.

A parser will take those tokens and use them to understand in some way:

we have a statement
it's a definition of an integer
the integer is called 'x'
'x' should be initialised with the value 1

回答3:

I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.

I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.

回答4:

(adding to the given answers)

Tokenizer will also remove any comments, and only return tokens to the Lexer.
Lexer will also define scopes for those tokens (variables/functions)
Parser then will build the code/program structure

来源：https://stackoverflow.com/questions/380455/looking-for-a-clear-definition-of-what-a-tokenizer-parser-and-lexers-are

标签

parsing

lexer

tokenize