Lexer/Tokenizer¶

What does this do?¶

The lexer or tokenizer (hereby referred to as “the lexer”), will read the given content — usually a file — character by character without backtracking. The lexer detects spaces, separators, or other key characters to tokenize the content. These tokens are in a array (list), with the token type, the value, and the position(s). The parser will take these tokens next. For more see the Parser page, or `Syntax page.`_

Input & Output¶

The input is the actual code. This is in an ASCII or Unicode encoding, basically plain text. The output is a list of tokens, which is usually shown as a table in human-friendly formats. A token is yet another list, or rather a tuple, containing the token name, token value, and position. The following is an example of an input and output for the Lexer.

Input:

print>"Somewhere over the rainbow!";

Output:

Token ID	Token Value	Position
WORD	print	1, 1
FUNC	>	1, 6
STRING_DEF	“	1, 7
WORD	Somewhere	1, 8
SPACE		1, 16
WORD	over	1, 17
SPACE		1, 21
WORD	the	1 22
SPACE		1, 25
WORD	rainbow!	1 26
END_STRING_DEF	“	1, 34
ENDLINE	;	1, 35