Lexer/Tokenizer¶
What does this do?¶
The lexer or tokenizer (hereby referred to as “the lexer”), will read the given content — usually a file — character by character without backtracking. The lexer detects spaces, separators, or other key characters to tokenize the content. These tokens are in a array (list), with the token type, the value, and the position(s). The parser will take these tokens next. For more see the Parser page, or `Syntax page.`_
Input & Output¶
The input is the actual code. This is in an ASCII or Unicode encoding, basically plain text. The output is a list of tokens, which is usually shown as a table in human-friendly formats. A token is yet another list, or rather a tuple, containing the token name, token value, and position. The following is an example of an input and output for the Lexer.
Input:
print>"Somewhere over the rainbow!";
Output:
Token ID | Token Value | Position |
---|---|---|
WORD | 1, 1 | |
FUNC | > | 1, 6 |
STRING_DEF | “ | 1, 7 |
WORD | Somewhere | 1, 8 |
SPACE | 1, 16 | |
WORD | over | 1, 17 |
SPACE | 1, 21 | |
WORD | the | 1 22 |
SPACE | 1, 25 | |
WORD | rainbow! | 1 26 |
END_STRING_DEF | “ | 1, 34 |
ENDLINE | ; | 1, 35 |