A custom Lexer/String tokenizer!

XOLT1268 · June 20, 2021, 10:36pm

Yo! I’ve created a super fast tokenizer that can take in custom tokens and supports type and value transforms, similar to moo.js (Moo).

Code Examples
I also have to mention, I used a lot of moo’s and boat’s lexer’s code practices. Boatbomber’s Lexer

Start by requiring the module, you can find it (Here).

local l = require(path.to.module)

Next, we will create an array of tokens for our lexer to compile. It is very important that it is an array, to keep priority, otherwise, your token order will get all messed up.

local rules = {
    {token = "WS", match = "[ \t]+"};
	{token = "Comment", match = "//.*\n?"};
	{token = "String", match = "(['\"])[^\n]*%1", type = function(inp)
		if inp == "\"\"" or inp == "''" then return "Empty-String" end
	end};
	{token = "INC-String", match = "(['\"])[^\n]*"};

	{token = "Iden", match = "[a-zA-Z_][a-zA-Z%d_]*", type = function(inp)
		local keyword = {"while", "var", "for", "if", "local"}
		local builtin = {"print", "string"}
		if table.find(keywords, inp) then return "Keywords" elseif table.find(builtins, inp) then return "Global" end
	end};
	{token = "Number", match = "%d(%.?)%d*", type = function(inp)
		if string.find(inp, "%.") then
			return "Float"
		end
		return "Number"
	end};
	{token = "Operators", match = "[:;<>/~%*%(%)\\%-=,{}%.#^%+%%]"};
	{token = "newline", match = "\n"};
	{token = "exception", match = ".+", error = true}
}

Each rule has a match, and a token/type, with others such as type, and error.

I’ll cover the other properties in the API section

Next up, we compile these tokens, which will return an object with functions such as peek, next and reset.

local lexer = l.compile(rules)

Now we can use reset to set our string Also important

lexer.reset("This is a test 'string'")

Finally we can run this code to print out our tokens

--Uncomment this to test speeds
--local start = os.clock()

for token in lexer.next do
    print(token)
end
--print(string.format("Lexer took %.2f ms to run", (os.clock()-start)*1000))

Which should have the following ouput:

{
type = "Iden",
value = "This",
raw = "This",
lineBreaks = 0,
col = 0,
offset = 1,
line = 1
} ... --And so on, for each token

After a tests, I’ve seen the module hit at most 8 milliseconds for a 60 line code sample, though the average was around 4-5

API

Module → Methods:
module.compile(tokens) → Returns a lexer

Lexer →

function lexerClass.next()

Advances to the next token in the string buffer, eating it

Returns:
token Token Object

function lexerClass.peek(n: number?)

Peeks at the n-th token away from current token index in the string buffer, without eating it

Parameters:

[n:number]
Number of places away to peek at, if nil, peeks at the current token

Returns:
token Token Object

function lexerClass.reset(str)

Resets Lexer buffer, and internal indexes

Parameters:

str
Replacement buffer, uses current buffer if nil

Token → Properties:
Type → The type of token, specified when the rules are compiled (Variant; Dependent on type transform)
Value → The value of the token, are bound by value transforms (Variant; Dependent on value transform)
Raw → The value of the token, except not bound by value transforms (String)
Line → Current Line (Num)
Col → Current Column (Num)
Offset → Current Index (Think of it as the column not affected by line breaks (Num)
lineBreaks → Number of line breaks in the token’s match (Num)

Rules → Properties:
Token → The type that the token will be labeled as
Match → The expression/literal that the lexer tries to find the token as
Value → A function that is run on the token’s match, useful for converting values, for example “30” → 30: Input to the function: Token Value
Type → A function that is run on the rule’s token, useful for defining keywords: Input to the function: Token Type
Error → Boolean, the token with this value as true will return if no other previous tokens are found, put this token as the final token.
Extra Notes: Passing a negative number to the peek function will return previous tokens.

Benchmarks: Soon, although a few tests show less than a millisecond for short strings, and <7-8 milliseconds for longer ones

XOLT1268 · June 24, 2021, 3:44am

A few tweaks were done, noticed a few bugs in the peek function, should be all fixed up, model is updated, let me know if you find any more bugs.