Introduction
What is a lexer?
To create a lexer one must first understand what a lexer even is. In computer science a lexer is used to accomplish lexical analysis. What this means is converting sequences of characters into something called tokens. Tokens are objects that give the analyzed characters additional meaning.
In this three part series of tutorials we will create a basic mathematical expression evaluator with a fully functioning lexer, parser, and interpreter.
What are tokens?
Let’s say I have the following expression:
3 + 7.2
If I were to pass this through a lexer the following tokens would be generated:
{ 'NUMBER', 3 }
{ 'PLUS' }
{ 'NUMBER', 7.2 }
So each token has a string identifying its type and an optional second parameter representing a value. Operators such as +, -, *, / don’t have values but numbers such as 1.4
, 92
, and 36
do. Tokens are useful because they give meaning to these different characters. A computer has no way of knowing that +
means addition but through a lexer and tokens we can provide that meaning.
Building the Lexer
Lexer class
Let’s get started on creating the actual lexer now that we understand what it does and how it works. Create a module dedicated to your lexer. We’ll use Lua metatables to simulate OOP and create a Lexer “class.”
function Lexer.new(text: string)
local self = setmetatable({}, {__index = Lexer})
-- Strip the string of any new line characters & create a position variable
self.text = string.gsub(text, '\n', '')
self.position = 0
-- We'll define this method in a second
self:next()
return self
end
The position variable in our lexer represents the current character position as we tokenize the passed text. The :next()
method will be used to advance the lexer to the next character in the string.
function Lexer:next()
-- Prevent the lexer from going over the initial text's length
if (self.position == string.len(self.text)) then
self.character = nil
return
end
-- Increase the position variable & set the current character to that position
self.position += 1
self.character = string.sub(self.text, self.position, self.position)
end
Great we now have a way to traverse through our string of text. Let’s start tokenizing this string. To do this we need to loop through the string until we reach the end and generate a token for each character until we have a list of tokens.
Preparing to tokenize
Let’s create the :tokenize()
method for our lexer.
function Lexer:tokenize()
local tokens = {}
-- Loop until the end of the string is reached
while (self.character ~= nil) do
-- We'll define this method in a second
local result = self:resolve(self.character)
-- If we get a valid result insert it into our final token table
if (result) then
table.insert(tokens, result)
end
end
-- Add the table to our lexer object
self.tokens = tokens
return self
end
With our loop created we now need to start resolving each character. However, before we do this we should define some utility functions.
Utility functions
Our first function will check to see if a character is a whitespace. We want to ignore whitespaces and only focus on valid characters when tokenizing.
local WHITESPACES = {'', ' ', '\n', '\t'}
local function isWhitespace(char: string)
return table.find(WHITESPACES, char) and true or false
end
Next we need a function to check if a character is a digit (0 - 9).
local DIGITS = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}
local function isDigit(char: string)
return table.find(DIGITS, tostring(char)) and true or false
end
Next we’ll make a function to check if a character is a digit or contains a decimal.
local function isNumberStream(char: string)
return (char ~= nil and (char == '.' or isDigit(char)))
end
Then we’ll make a function to determine if a table of characters contains a decimal point either at the front or the end. If so this function will prepend or append a 0
. For example, .01
turns to 0.01
and 4.
turns into 4.0
.
local function resolveIndependentDecimal(array: {string})
if (array[1] == '.') then
table.insert(array, 1, '0')
end
if (array[#array] == '.') then
table.insert(array, '0')
end
return array
end
Creating tokens
With our utility functions done we are nearly ready to actually convert characters into tokens. But we haven’t actually created any tokens. For the sake of simplicity in this tutorial our tokens will be a table containing a string and optional value as shown in the introduction. Although, we’ll still define a function to create this table just so we know where we’re actually using tokens versus tables.
local function Token(name: string, value: number?)
return { name, value }
end
Tokenizing
Finally we can convert characters into tokens!
function Lexer:resolve(char: string)
if (isWhitespace(char)) then -- Ignore whitespaces
self:next()
elseif (char == '.' or isDigit(char)) then -- Numbers
return self:resolveNumber() -- We'll define this next
elseif (char == '+') then -- Plus (for addition)
self:next()
return Token('PLUS')
elseif (char == '-') then -- Minus (for subtraction)
self:next()
return Token('MINUS')
elseif (char == '*') then -- Asterik (for multiplication)
self:next()
return Token('ASTERIK')
elseif (char == '/') then -- Slash (for division)
self:next()
return Token('SLASH')
elseif (char == '(') then -- Left parentheses
self:next()
return Token('LPAREN')
elseif (char == ')') then -- Right parentheses
self:next()
return Token('RPAREN')
else
error('Unknown character: ' .. char)
end
end
You’ll notice when a character is a digit or decimal point we call a nonexistent method :resolveNumber()
. Let’s write out this function now. It will be used to create a number from multiple characters.
function Lexer:resolveNumber()
local array = { self.character } -- Represents a number being built
local decimals = 0 -- Keep track of how many decimals have been found
-- Move on to the next character
self:next()
-- While there is a valid digit or decimal point
while (isNumberStream(self.character)) do
-- Increase decimal count
if (self.character == '.') then
decimals += 1
-- Numbers don't have more than 1 decimal point
if (decimals > 1) then
break
end
end
-- Add the character to our array
table.insert(array, self.character)
-- Move on to the next character
self:next()
end
-- If there's a decimal check if there is a leading or trailing decimal point and fix it
if (decimals > 0) then
array = resolveIndependentDecimal(array)
end
-- Finally return our completed number as a token
return Token('NUMBER', tonumber(table.concat(array)))
end
Believe it or not, that’s our completed lexer! Let’s move on to testing it.
Testing
Here I have a script requiring our lexer module to test it out.
local Lexer = require(script.Lexer)
-- Create our lexer with the expression .8 + 2.7
local lexer = Lexer.new('.8 + 2.7')
-- Tokenize the string
lexer:tokenize()
-- Let's see what we get!
for i, v in pairs(lexer.tokens) do
print(i, '|', v[1], '(', v[2], ')') -- Remember our tokens are just arrays
end
If you did everything correctly your output should look like so:
Whitespaces were ignored and the numbers were successfully built into tokens as well as the plus sign. If you try to tokenize an invalid character you’ll receive an unknown character error.
Conclusion
This is a very basic lexer and you can add so much more functionality. If you’re interested in learning about lexical analysis outside of math evaluation you can check out the following resources:
In the next part of this series we’ll create a parser to convert our tokens into postfix notation for easy evaluation. This will be further explained then but for now I hope you found this tutorial useful or at least interesting.