Yo! I’ve created a super fast tokenizer that can take in custom tokens and supports type and value transforms, similar to moo.js (Moo).
Code Examples
I also have to mention, I used a lot of moo’s and boat’s lexer’s code practices. Boatbomber’s Lexer
Start by requiring the module, you can find it (Here).
local l = require(path.to.module)
Next, we will create an array of tokens for our lexer to compile. It is very important that it is an array, to keep priority, otherwise, your token order will get all messed up.
local rules = {
{token = "WS", match = "[ \t]+"};
{token = "Comment", match = "//.*\n?"};
{token = "String", match = "(['\"])[^\n]*%1", type = function(inp)
if inp == "\"\"" or inp == "''" then return "Empty-String" end
end};
{token = "INC-String", match = "(['\"])[^\n]*"};
{token = "Iden", match = "[a-zA-Z_][a-zA-Z%d_]*", type = function(inp)
local keyword = {"while", "var", "for", "if", "local"}
local builtin = {"print", "string"}
if table.find(keywords, inp) then return "Keywords" elseif table.find(builtins, inp) then return "Global" end
end};
{token = "Number", match = "%d(%.?)%d*", type = function(inp)
if string.find(inp, "%.") then
return "Float"
end
return "Number"
end};
{token = "Operators", match = "[:;<>/~%*%(%)\\%-=,{}%.#^%+%%]"};
{token = "newline", match = "\n"};
{token = "exception", match = ".+", error = true}
}
Each rule has a match, and a token/type, with others such as type, and error.
I’ll cover the other properties in the API section
Next up, we compile these tokens, which will return an object with functions such as peek, next and reset.
local lexer = l.compile(rules)
Now we can use reset to set our string Also important
lexer.reset("This is a test 'string'")
Finally we can run this code to print out our tokens
--Uncomment this to test speeds
--local start = os.clock()
for token in lexer.next do
print(token)
end
--print(string.format("Lexer took %.2f ms to run", (os.clock()-start)*1000))
Which should have the following ouput:
{
type = "Iden",
value = "This",
raw = "This",
lineBreaks = 0,
col = 0,
offset = 1,
line = 1
} ... --And so on, for each token
After a tests, I’ve seen the module hit at most 8 milliseconds for a 60 line code sample, though the average was around 4-5
API
Module → Methods:
module.compile(tokens) → Returns a lexer
Lexer →
function lexerClass.next()
Advances to the next token in the string buffer, eating it
Returns:
token Token Object
function lexerClass.peek(n: number?)
Peeks at the n-th token away from current token index in the string buffer, without eating it
Parameters:
-
[n:number]
Number of places away to peek at, if nil, peeks at the current token
Returns:
token Token Object
function lexerClass.reset(str)
Resets Lexer buffer, and internal indexes
Parameters:
-
str
Replacement buffer, uses current buffer if nil
Token → Properties:
Type → The type of token, specified when the rules are compiled (Variant; Dependent on type transform)
Value → The value of the token, are bound by value transforms (Variant; Dependent on value transform)
Raw → The value of the token, except not bound by value transforms (String)
Line → Current Line (Num)
Col → Current Column (Num)
Offset → Current Index (Think of it as the column not affected by line breaks (Num)
lineBreaks → Number of line breaks in the token’s match (Num)
Rules → Properties:
Token → The type that the token will be labeled as
Match → The expression/literal that the lexer tries to find the token as
Value → A function that is run on the token’s match, useful for converting values, for example “30” → 30: Input to the function: Token Value
Type → A function that is run on the rule’s token, useful for defining keywords: Input to the function: Token Type
Error → Boolean, the token with this value as true will return if no other previous tokens are found, put this token as the final token.
Extra Notes: Passing a negative number to the peek function will return previous tokens.
Benchmarks: Soon, although a few tests show less than a millisecond for short strings, and <7-8 milliseconds for longer ones