As a note, I’m still testing the table.create
function. I think table.pack
might be faster, but I don’t even know.
I’ve always complained that Lua’s ‘regex’ was bad.
Then someone made a lexer with it. Nice job…
But seems like it does not have the facilities for a code generation
Oops! There was a bug with incompleted single line strings that could cause it to error.
I fixed it, and edited my above reply with my current (fixed) version as of 12/12/19.
So, how do you use the lexer?
What does it return, a function?
use os.clock
, it’s faster and more precise
This new version is not fully backwards compatible.
(Which is why it’s a new reply and not just another edit to the one above)
Previous versions had a space
token type. Seeing as the source content of them was always invisible text, I felt that it was pretty wasteful to have tons of token returns of effectively blank tokens.
This new version no longer has a space
token and instead all token types can have trailing/leading whitespace in them. This cuts down on the total token count by a massive amount, averaging around 30%-50% lower counts.
Simple example of what this means:
local SourceSample = "for i = 1, n do end"
for token,src in lexer.scan(source) do
print(token, "'"...src..."'")
end
Old Output: (14 Token returns)
keyword 'for'
space ' '
iden 'i'
space ' '
operator '='
space ' '
number '1'
operator ','
space ' '
iden 'n'
space ' '
keyword 'do'
space ' '
keyword 'end'
New Output: (8 Token returns)
keyword 'for '
iden 'i '
operator '= '
number '1'
operator ', '
iden 'n '
keyword 'do '
keyword 'end'
Source Code: ***Last updated 8/16/2020***
--[=[
Lexical scanner for creating a sequence of tokens from Lua source code.
This is a heavily modified and Roblox-optimized version of
the original Penlight Lexer module:
https://github.com/stevedonovan/Penlight
Authors:
stevedonovan <https://github.com/stevedonovan> ----------------- Original Penlight lexer author
ryanjmulder <https://github.com/ryanjmulder> ----------------- Penlight lexer contributer
mpeterv <https://github.com/mpeterv> ----------------- Penlight lexer contributer
Tieske <https://github.com/Tieske> ----------------- Penlight lexer contributer
boatbomber <https://github.com/boatbomber> ----------------- Roblox port, added builtin token, added patterns for incomplete syntax, bug fixes, behavior changes, token optimization
Sleitnick <https://github.com/Sleitnick> ----------------- Roblox optimizations
howmanysmall <https://github.com/howmanysmall> ----------------- Lua + Roblox optimizations
Usage:
local source = "for i = 1, n do end"
-- The 'scan' function returns a token iterator:
for token,src in lexer.scan(source) do
print(token, "'"..src.."'")
end
-> keyword 'for '
-> iden 'i '
-> operator '= '
-> number '1'
-> operator ', '
-> iden 'n '
-> keyword 'do '
-> keyword 'end'
List of possible tokens:
- iden
- keyword
- builtin
- string
- number
- comment
- operator
--]=]
local lexer = {}
local ipairs = ipairs
local Prefix,Suffix,Cleaner = "^[ \t\n\0\a\b\v\f\r]*", "[ \t\n\0\a\b\v\f\r]*", "[ \t\n\0\a\b\v\f\r]+"
local NUMBER_A = "0x[%da-fA-F]+"
local NUMBER_B = "%d+%.?%d*[eE][%+%-]?%d+"
local NUMBER_C = "%d+[%._]?[%d_eE]*"
local OPERATORS = "[:;<>/~%*%(%)%-={},%.#%^%+%%]+"
local BRACKETS = "[%[%]]+" -- needs to be separate pattern from other operators or it'll mess up multiline strings
local IDEN = "[%a_][%w_]*"
local STRING_EMPTY = "(['\"])%1" --Empty String
local STRING_PLAIN = [=[(['"])[%w%p \t\v\b\f\r\a]-([^%\]%1)]=] --TODO: Handle escaping escapes
local STRING_INCOMP_A = "(['\"]).-\n" --Incompleted String with next line
local STRING_INCOMP_B = "(['\"])[^\n]*" --Incompleted String without next line
local STRING_MULTI = "%[(=*)%[.-%]%1%]" --Multiline-String
local STRING_MULTI_INCOMP = "%[=*%[.-.*" --Incompleted Multiline-String
local COMMENT_MULTI = "%-%-%[(=*)%[.-%]%1%]" --Completed Multiline-Comment
local COMMENT_MULTI_INCOMP = "%-%-%[=*%[.-.*" --Incompleted Multiline-Comment
local COMMENT_PLAIN = "%-%-.-\n" --Completed Singleline-Comment
local COMMENT_INCOMP = "%-%-.*" --Incompleted Singleline-Comment
local TABLE_EMPTY = {}
local lua_keyword = {
["and"] = true, ["break"] = true, ["do"] = true, ["else"] = true, ["elseif"] = true,
["end"] = true, ["false"] = true, ["for"] = true, ["function"] = true, ["if"] = true,
["in"] = true, ["local"] = true, ["nil"] = true, ["not"] = true, ["while"] = true,
["or"] = true, ["repeat"] = true, ["return"] = true, ["then"] = true, ["true"] = true,
["self"] = true, ["until"] = true,
["continue"] = true,
["plugin"] = true, --Highlights as a keyword instead of a builtin cuz Roblox is weird
}
local lua_builtin = {
-- Lua Functions
["assert"] = true;["collectgarbage"] = true;["error"] = true;["getfenv"] = true;
["getmetatable"] = true;["ipairs"] = true;["loadstring"] = true;["newproxy"] = true;
["next"] = true;["pairs"] = true;["pcall"] = true;["print"] = true;["rawequal"] = true;
["rawget"] = true;["rawset"] = true;["select"] = true;["setfenv"] = true;["setmetatable"] = true;
["tonumber"] = true;["tostring"] = true;["type"] = true;["unpack"] = true;["xpcall"] = true;
-- Lua Variables
["_G"] = true;["_VERSION"] = true;
-- Lua Tables
["bit32"] = true;["coroutine"] = true;["debug"] = true;
["math"] = true;["os"] = true;["string"] = true;
["table"] = true;["utf8"] = true;
-- Roblox Functions
["delay"] = true;["elapsedTime"] = true;["gcinfo"] = true;["require"] = true;
["settings"] = true;["spawn"] = true;["tick"] = true;["time"] = true;["typeof"] = true;
["UserSettings"] = true;["wait"] = true;["warn"] = true;["ypcall"] = true;
-- Roblox Variables
["Enum"] = true;["game"] = true;["shared"] = true;["script"] = true;
["workspace"] = true;
-- Roblox Tables
["Axes"] = true;["BrickColor"] = true;["CellId"] = true;["CFrame"] = true;["Color3"] = true;
["ColorSequence"] = true;["ColorSequenceKeypoint"] = true;["DateTime"] = true;
["DockWidgetPluginGuiInfo"] = true;["Faces"] = true;["Instance"] = true;["NumberRange"] = true;
["NumberSequence"] = true;["NumberSequenceKeypoint"] = true;["PathWaypoint"] = true;
["PhysicalProperties"] = true;["PluginDrag"] = true;["Random"] = true;["Ray"] = true;["Rect"] = true;
["Region3"] = true;["Region3int16"] = true;["TweenInfo"] = true;["UDim"] = true;["UDim2"] = true;
["Vector2"] = true;["Vector2int16"] = true;["Vector3"] = true;["Vector3int16"] = true;
}
local function idump(tok)
--print("tok unknown:",tok)
return coroutine.yield("iden", tok)
end
local function odump(tok)
return coroutine.yield("operator", tok)
end
local function ndump(tok)
return coroutine.yield("number", tok)
end
local function sdump(tok)
return coroutine.yield("string", tok)
end
local function cdump(tok)
return coroutine.yield("comment", tok)
end
local function lua_vdump(tok)
-- Since we merge spaces into the tok, we need to remove them
-- in order to check the actual word it contains
local cleanTok = string.gsub(tok,Cleaner,"")
if lua_keyword[cleanTok] then
return coroutine.yield("keyword", tok)
elseif lua_builtin[cleanTok] then
return coroutine.yield("builtin", tok)
else
return coroutine.yield("iden", tok)
end
end
local lua_matches = {
-- Indentifiers
{Prefix.. IDEN ..Suffix, lua_vdump},
-- Numbers
{Prefix.. NUMBER_A ..Suffix, ndump},
{Prefix.. NUMBER_B ..Suffix, ndump},
{Prefix.. NUMBER_C ..Suffix, ndump},
-- Strings
{Prefix.. STRING_EMPTY ..Suffix, sdump},
{Prefix.. STRING_PLAIN ..Suffix, sdump},
{Prefix.. STRING_INCOMP_A ..Suffix, sdump},
{Prefix.. STRING_INCOMP_B ..Suffix, sdump},
{Prefix.. STRING_MULTI ..Suffix, sdump},
{Prefix.. STRING_MULTI_INCOMP ..Suffix, sdump},
-- Comments
{Prefix.. COMMENT_MULTI ..Suffix, cdump},
{Prefix.. COMMENT_MULTI_INCOMP ..Suffix, cdump},
{Prefix.. COMMENT_PLAIN ..Suffix, cdump},
{Prefix.. COMMENT_INCOMP ..Suffix, cdump},
-- Operators
{Prefix.. OPERATORS ..Suffix, odump},
{Prefix.. BRACKETS ..Suffix, odump},
-- Unknown
{"^.", idump}
}
--- Create a plain token iterator from a string.
-- @tparam string s a string.
function lexer.scan(s)
local startTime = os.clock()
lexer.finished = false
local function lex(first_arg)
local line_nr = 0
local sz = #s
local idx = 1
-- res is the value used to resume the coroutine.
local function handle_requests(res)
while res do
local tp = type(res)
-- Insert a token list:
if tp == "table" then
res = coroutine.yield("", "")
for _, t in ipairs(res) do
res = coroutine.yield(t[1], t[2])
end
elseif tp == "string" then -- Or search up to some special pattern:
local i1, i2 = string.find(s, res, idx)
if i1 then
idx = i2 + 1
res = coroutine.yield("", string.sub(s, i1, i2))
else
res = coroutine.yield("", "")
idx = sz + 1
end
else
res = coroutine.yield(line_nr, idx)
end
end
end
handle_requests(first_arg)
line_nr = 1
while true do
if idx > sz then
while true do
handle_requests(coroutine.yield())
end
end
for _, m in ipairs(lua_matches) do
local findres = table.create(2)
local i1, i2 = string.find(s, m[1], idx)
findres[1], findres[2] = i1, i2
if i1 then
local tok = string.sub(s, i1, i2)
idx = i2 + 1
lexer.finished = idx > sz
-- if lexer.finished then
-- print(string.format("Lex took %.2f milliseconds", (os.clock()-startTime)*1000 ))
-- end
local res = m[2](tok, findres)
if string.find(tok, "\n") then
-- Update line number:
local _, newlines = string.gsub(tok, "\n", TABLE_EMPTY)
line_nr = line_nr + newlines
end
handle_requests(res)
break
end
end
end
end
return coroutine.wrap(lex)
end
return lexer
Edit: Check out the true latest on the GitHub repository instead! New features, perf improvements, and a syntax highlighter with RichText!
Sorry for bumping this thread but I got interested in this so I decided to make one myself, when I “tested” it, it’s apparently 5x faster than @boatbomber’s one and this OP’s one. lexer3
is the OP’s one while lexer2
is the @boatbomber’s one while lexer1
is the script for my lexer.
Test code:
local s = "for i = 1, n do end";
local a = os.clock();
for _ = 1, 1000 do
for token, src in lexer1.lex(s) do
end;
end;
print(string.format("%.0f μs ", (os.clock() - a) * 1E6))
local a = os.clock();
for _ = 1, 1000 do
for _, token, src in lexer2.scan(s) do
end;
end;
print(string.format("%.0f μs ", (os.clock() - a) * 1E6))
local a = os.clock();
for _ = 1, 1000 do
for _, token, src in lexer3.scan(s) do
end;
end;
print(string.format("%.0f μs ", (os.clock() - a) * 1E6))
Outputs
2444 μs
13428 μs
16222 μs
Code:
--[=[
A Lua lexical scanner
BSD 2-Clause Licence
Copyright ©, 2020 - Blockzez (devforum.roblox.com/u/Blockzez and github.com/Blockzez)
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
]=]
local lexer = { };
--
local identifiers = {
-- Reserved keywords
['and'] = 'keyword', ['break'] = 'keyword', ['continue'] = 'keyword', ['do'] = 'keyword', ['else'] = 'keyword', ['elseif'] = 'keyword', ['end'] = 'keyword',
['false'] = 'keyword', ['for'] = 'keyword', ['function'] = 'keyword', ['goto'] = 'keyword', ['if'] = 'keyword', ['in'] = 'keyword', ['local'] = 'keyword',
['nil'] = 'keyword', ['not'] = 'keyword', ['or'] = 'keyword', ['repeat'] = 'keyword', ['return'] = 'keyword', ['then'] = 'keyword', ['true'] = 'keyword',
['until'] = 'keyword', ['while'] = 'keyword',
-- Lua globals
assert = 'builtin', collectgarbage = 'builtin', error = 'builtin', getfenv = 'builtin', getmetatable = 'builtin', ipairs = 'builtin', loadstring = 'builtin', next = 'builtin', newproxy = 'builtin',
pairs = 'builtin', pcall = 'builtin', print = 'builtin', rawequal = 'builtin', rawget = 'builtin', rawset = 'builtin', select = 'builtin', setfenv = 'builtin', setmetatable = 'builtin',
tonumber = 'builtin', tostring = 'builtin', type = 'builtin', unpack = 'builtin', xpcall = 'builtin',
--
_G = 'builtin', _VERSION = 'builtin',
-- Lua libraries
bit32 = 'builtin', coroutine = 'builtin', debug = 'builtin', math = 'builtin', os = 'builtin', string = 'builtin', table = 'builtin', utf8 = 'builtin',
-- Roblox globals
delay = 'builtin', elapsedTime = 'builtin', require = 'builtin', settings = 'builtin', spawn = 'builtin', stats = 'builtin', tick = 'builtin', UserSettings = 'builtin', wait = 'builtin', warn = 'builtin',
--
Enum = 'builtin', game = 'builtin', shared = 'builtin', script = 'builtin', plugin = 'builtin', workspace = 'builtin',
-- Depcreated
printidentity = 'builtin_deprecated', version = 'builtin_deprecated',
-- Roblox types
Axes = 'builtin', BrickColor = 'builtin', CFrame = 'builtin', Color3 = 'builtin', ColorSequence = 'builtin', ColorSequenceKeypoint = 'builtin', DateTime = 'builtin', DockWidgetPluginGuiInfo = 'builtin',
Faces = 'builtin', Instance = 'builtin', NumberRange = 'builtin', NumberSequence = 'builtin', NumberSequenceKeypoint = 'builtin', PathWaypoint = 'builtin', PhysicalProperties = 'builtin', Random = 'builtin',
Ray = 'builtin', RaycastParams = 'builtin', Rect = 'builtin', Region3 = 'builtin', Region3int16 = 'builtin', TweenInfo = 'builtin', UDim = 'builtin', UDim2 = 'builtin', Vector2 = 'builtin',
Vector2int16 = 'builtin', Vector3 = 'builtin', Vector3int16 = 'builtin',
};
local operator = {
[0x2B] = true, [0x2D] = true, [0x2A] = true, [0x2F] = true, [0x25] = true,
[0x3C] = true, [0x3E] = true, [0x3D] = true, [0x7E] = true,
[0x26] = true, [0x7C] = true, [0x5E] = true,
[0x23] = true, [0x2E] = true,
[0x28] = true, [0x29] = true, [0x2C] = true,
[0x5B] = true, [0x5D] = true, [0x3A] = true,
[0x7B] = true, [0x7D] = true, [0x2E] = true,
};
local bases = {
[0x42] = 2, [0x62] = 2,
[0x4F] = 8, [0x6F] = 8,
[0x58] = 16, [0x78] = 16
};
local function to_char_array(self)
local len = #self;
if len <= 7997 then
return { s = self, string.byte(self, 1, len) };
end;
local clen = math.ceil(len / 7997);
local ret = table.create(len);
for i = 1, clen do
local c = table.pack(string.byte(self, i * 7997 - 7996, i * 7997 - (i == clen and 7997 - ((len - 1) % 7997 + 1) or 0)));
table.move(c, 1, c.n, i * 7997 - 7996, ret);
end;
ret.s = self;
return ret;
end;
local function next_lex(codes, i0)
local c = codes[i0];
if not c then
return;
end;
local ttype = 'other';
local i1 = i0;
if (c >= 0x30 and c <= 0x39) or (c == 0x2E and (codes[i0 + 1] and codes[i0 + 1] >= 0x30 and codes[i0 + 1] <= 0x39)) then
-- Numbers
local isfloat, has_expt = c == 0x2E, false;
if c == 0x30 and bases[codes[i1 + 1]] then
i1 += 2;
end;
while true do
i1 += 1;
c = codes[i1];
if c == 0x2E then
if isfloat then
break;
end;
isfloat = true;
elseif c == 0x45 or c == 0x65 then
if isfloat or has_expt then
break
end;
has_expt = true;
elseif (not c) or (c < 0x30 or c > 0x39 and c ~= 0x5F) then
break;
end;
end;
ttype = 'number';
elseif c == 0x22 or c == 0x27 then
-- Strings
repeat
i1 += 1;
if codes[i1] == 0x5C then
i1 += 1;
local c2 = codes[i1];
if c2 == 0x75 then
i1 += 5;
elseif c2 == 0x78 then
i1 += 3;
else
i1 += 1;
end;
end;
until codes[i1] == c or not codes[i1];
i1 += 1;
ttype = 'string';
elseif operator[c] then
-- Operators/Comments/Strings
if c == 0x2D and codes[i0 + 1] == 0x2D then
i1 += 2;
local eq_sign = -1;
if codes[i1] == 0x5B then
repeat
i1 += 1;
eq_sign += 1;
until codes[i1] ~= 0x3D or not codes[i1];
end;
if eq_sign > -1 then
repeat
i1 = table.find(codes, 0x5D, i1 + 1) or #codes + 1;
local c = eq_sign;
while c > 0 and codes[i1 + 1] == 0x3D do
c -= 1;
i1 += 1;
end;
until c == 0 or not codes[i1];
i1 += eq_sign + 2;
else
repeat
i1 += 1;
until codes[i1] == 0x0A or not codes[i1];
end;
ttype = "comment";
elseif c == 0x5B and (codes[i0 + 1] == 0x5B or codes[i0 + 1] == 0x3D) then
local eq_sign = -1;
repeat
i1 += 1;
eq_sign += 1;
until codes[i1] ~= 0x3D or not codes[i1];
repeat
i1 = table.find(codes, 0x5D, i1 + 1) or #codes + 1;
local c = eq_sign;
while c > 0 and codes[i1 + 1] == 0x3D do
c -= 1;
i1 += 1;
end;
until c == 0 or not codes[i1];
i1 += eq_sign + 2;
ttype = "string";
else
ttype = "operator";
repeat
i1 += 1;
until not operator[codes[i1]];
end;
elseif (c >= 0x41 and c <= 0x5A) or (c >= 0x61 and c <= 0x7A) or c == 0x5F then
-- Identifiers
repeat
i1 += 1;
c = codes[i1];
until (not c) or (c < 0x30 or c > 0x39) and (c < 0x41 or c > 0x5A) and (c < 0x61 or c > 0x7A) and c ~= 0x5F;
ttype = identifiers[string.sub(codes.s, i0, i1 - 1)] or "identifier";
else
-- Others
repeat
i1 += 1;
c = codes[i1];
until (not c) or (c < 0x30 or c > 0x39) and (c < 0x41 or c > 0x5A) and (c < 0x61 or c > 0x7A) and c ~= 0x5F and c ~= 0x22 and c ~= 0x27 and not operator[c];
ttype = "other"
end;
-- Whitespaces
while codes[i1] and ((codes[i1] >= 0x09 and codes[i1] <= 0x0D) or codes[i1] == 0x20) do
i1 += 1;
end;
return i1, ttype, string.sub(codes.s, i0, i1 - 1);
end;
function lexer.lex(str)
return next_lex, to_char_array(str), 1;
end;
return lexer;
I love yours! Definitely faster to do things down at that level instead of plain string manipulation.
Probably harder to maintain and debug though, but that’s the price you pay for such speed.
Thank you for sharing!
I made a pretty massive feature addition!
Non-sequential reading via a reusable Navigator object!
For my Code Outline plugin, I needed to be able to look at tokens both ahead and behind in order to gather data on variable names & values.
However, the lexer did not support reading in arbitrary order, it only had a sequential iterator (lexer.scan) to read from.
My initial solution was very inefficient- I created an array, then sequentially lexed the source and stored all the resulting tokens in the array. Then, I iterated over the array and could just index it wherever I wanted, whenever I wanted.
As I’m sure you realized, that means I had to loop over every token twice and possibly index it even more times. One loop to generate and store the token, and one loop over the array, and possibly indexing more if it was a relevant token for the work.
Obviously, I wasn’t going to sit around and accept that.
I wrote lexer.navigator()
as a built-in wrapper for lexer.scan()
that would allow me to read in any order I wanted, and only generate tokens when called for. There’s some minor performance overhead (especially cuz it still has to generate sequentially internally) but it’s better than multiple loops!
API & Usage:
Calling lexer.navigator()
returns a Navigator
.
Navigator:Destroy()
Cleans up the Navigator object.
Navigator:SetSource(Source)
Clears any old data and prepares the navigator to lex the given Source.
Navigator.Next
Iterator function that behaves like lexer.scan.
Navigator.Peek(PeekAmount)
Function to see a token that is PeekAmount away from where the Next iterator is currently. Passing a negative value will look backward, passing a positive value will look ahead.
local source = "for i = 1, n do end"
-- The 'navigator' function returns a navigator object:
-- Navigators allow you to use nav.Peek() for non-sequential reads
local nav = lexer.navigator()
nav:SetSource(source) -- You can reuse navigators by setting a new Source
for token,src in nav.Next do
print(token, "'"..src.."'")
local peektoken,peeksrc = nav.Peek(2) -- You can peek backwards by passing a negative input
if peektoken then
print(" Peeked ahead by 2:",peektoken,"'"..peeksrc.."'")
end
end
--> keyword 'for '
--> Peeked ahead by 2: operator '= '
--> iden 'i '
--> Peeked ahead by 2: number '1'
--> operator '= '
--> Peeked ahead by 2: operator ', '
--> number '1'
--> Peeked ahead by 2: iden 'n '
--> operator ', '
--> Peeked ahead by 2: keyword 'do '
--> iden 'n '
--> Peeked ahead by 2: keyword 'end'
--> keyword 'do '
--> keyword 'end'
Notes:
When you peek ahead, it’ll sequentially generate tokens to that desired token and then return it. Then, the .Next iterator will get to use those token values generated by that peek, until it passes the point the peek when until.
When you peek behind, it just grabs the token data out of its token cache so there’s nearly no performance hit.
The navigator should theoretically work with any version of the lexer, since it’s wrapping the given lexer.scan()
function. Therefore, if you’re still relying on an older version’s behavior, you can just add the lexer.navigator()
function to the end of your module and it should behave as expected.
Navigator Source
function lexer.navigator()
local nav = {
Source = "";
TokenCache = table.create(50);
_RealIndex = 0;
_UserIndex = 0;
_ScanThread = nil;
}
function nav:Destroy()
self.Source = nil
self._RealIndex = nil;
self._UserIndex = nil;
self.TokenCache = nil;
self._ScanThread = nil;
end
function nav:SetSource(SourceString)
self.Source = SourceString
self._RealIndex = 0;
self._UserIndex = 0;
table.clear(self.TokenCache)
self._ScanThread = coroutine.create(function()
for Token,Src in lexer.scan(self.Source) do
self._RealIndex += 1
self.TokenCache[self._RealIndex] = {Token; Src;}
coroutine.yield(Token,Src)
end
end)
end
function nav.Next()
nav._UserIndex += 1
if nav._RealIndex >= nav._UserIndex then
-- Already scanned, return cached
return table.unpack(nav.TokenCache[nav._UserIndex])
else
if coroutine.status(nav._ScanThread) == 'dead' then
-- Scan thread dead
return
else
local success, token, src = coroutine.resume(nav._ScanThread)
if success and token then
-- Scanned new data
return token,src
else
-- Lex completed
return
end
end
end
end
function nav.Peek(PeekAmount)
local GoalIndex = nav._UserIndex + PeekAmount
if nav._RealIndex >= GoalIndex then
-- Already scanned, return cached
if GoalIndex > 0 then
return table.unpack(nav.TokenCache[GoalIndex])
else
-- Invalid peek
return
end
else
if coroutine.status(nav._ScanThread) == 'dead' then
-- Scan thread dead
return
else
local IterationsAway = GoalIndex - nav._RealIndex
local success, token, src = nil,nil,nil
for i=1, IterationsAway do
success, token, src = coroutine.resume(nav._ScanThread)
if not (success or token) then
-- Lex completed
break
end
end
return token,src
end
end
end
return nav
end
Full Latest Lexer Source
--[=[
Lexical scanner for creating a sequence of tokens from Lua source code.
This is a heavily modified and Roblox-optimized version of
the original Penlight Lexer module:
https://github.com/stevedonovan/Penlight
Authors:
stevedonovan <https://github.com/stevedonovan> ----------- Original Penlight lexer author
ryanjmulder <https://github.com/ryanjmulder> ------------- Penlight lexer contributer
mpeterv <https://github.com/mpeterv> --------------------- Penlight lexer contributer
Tieske <https://github.com/Tieske> ----------------------- Penlight lexer contributer
boatbomber <https://github.com/boatbomber> --------------- Roblox port, added builtin token, added patterns for incomplete syntax, bug fixes, behavior changes, token optimization
Sleitnick <https://github.com/Sleitnick> ----------------- Roblox optimizations
howmanysmall <https://github.com/howmanysmall> ----------- Lua + Roblox optimizations
boatbomber <https://github.com/boatbomber> --------------- Added lexer.navigator() for non-sequential reads
List of possible tokens:
- iden
- keyword
- builtin
- string
- number
- comment
- operator
Usage:
local source = "for i = 1, n do end"
-- The 'scan' function returns a token iterator:
for token,src in lexer.scan(source) do
print(token, "'"..src.."'")
end
--> keyword 'for '
--> iden 'i '
--> operator '= '
--> number '1'
--> operator ', '
--> iden 'n '
--> keyword 'do '
--> keyword 'end'
-- The 'navigator' function returns a navigator object:
-- Navigators allow you to use nav.Peek() for non-sequential reads
local nav = lexer.navigator()
nav:SetSource(source) -- You can reuse navigators by setting a new Source
for token,src in nav.Next do
print(token, "'"..src.."'")
local peektoken,peeksrc = nav.Peek(2) -- You can peek backwards by passing a negative input
if peektoken then
print(" Peeked ahead by 2:",peektoken,"'"..peeksrc.."'")
end
end
--> keyword 'for '
--> Peeked ahead by 2: operator '= '
--> iden 'i '
--> Peeked ahead by 2: number '1'
--> operator '= '
--> Peeked ahead by 2: operator ', '
--> number '1'
--> Peeked ahead by 2: iden 'n '
--> operator ', '
--> Peeked ahead by 2: keyword 'do '
--> iden 'n '
--> Peeked ahead by 2: keyword 'end'
--> keyword 'do '
--> keyword 'end'
--]=]
local lexer = {}
local Prefix,Suffix,Cleaner = "^[ \t\n\0\a\b\v\f\r]*", "[ \t\n\0\a\b\v\f\r]*", "[ \t\n\0\a\b\v\f\r]+"
local NUMBER_A = "0x[%da-fA-F]+"
local NUMBER_B = "%d+%.?%d*[eE][%+%-]?%d+"
local NUMBER_C = "%d+[%._]?[%d_eE]*"
local OPERATORS = "[:;<>/~%*%(%)%-=,{}%.#%^%+%%]+"
local BRACKETS = "[%[%]]+" -- needs to be separate pattern from other operators or it'll mess up multiline strings
local IDEN = "[%a_][%w_]*"
local STRING_EMPTY = "(['\"])%1" --Empty String
local STRING_PLAIN = [=[(['"])[%w%p \t\v\b\f\r\a]-([^%\]%1)]=] --TODO: Handle escaping escapes
local STRING_INCOMP_A = "(['\"]).-\n" --Incompleted String with next line
local STRING_INCOMP_B = "(['\"])[^\n]*" --Incompleted String without next line
local STRING_MULTI = "%[(=*)%[.-%]%1%]" --Multiline-String
local STRING_MULTI_INCOMP = "%[=*%[.-.*" --Incompleted Multiline-String
local COMMENT_MULTI = "%-%-%[(=*)%[.-%]%1%]" --Completed Multiline-Comment
local COMMENT_MULTI_INCOMP = "%-%-%[=*%[.-.*" --Incompleted Multiline-Comment
local COMMENT_PLAIN = "%-%-.-\n" --Completed Singleline-Comment
local COMMENT_INCOMP = "%-%-.*" --Incompleted Singleline-Comment
local TABLE_EMPTY = {}
local lua_keyword = {
["and"] = true, ["break"] = true, ["do"] = true, ["else"] = true, ["elseif"] = true,
["end"] = true, ["false"] = true, ["for"] = true, ["function"] = true, ["if"] = true,
["in"] = true, ["local"] = true, ["nil"] = true, ["not"] = true, ["while"] = true,
["or"] = true, ["repeat"] = true, ["return"] = true, ["then"] = true, ["true"] = true,
["self"] = true, ["until"] = true,
["continue"] = true,
["plugin"] = true, --Highlights as a keyword instead of a builtin cuz Roblox is weird
}
local lua_builtin = {
-- Lua Functions
["assert"] = true;["collectgarbage"] = true;["error"] = true;["getfenv"] = true;
["getmetatable"] = true;["ipairs"] = true;["loadstring"] = true;["newproxy"] = true;
["next"] = true;["pairs"] = true;["pcall"] = true;["print"] = true;["rawequal"] = true;
["rawget"] = true;["rawset"] = true;["select"] = true;["setfenv"] = true;["setmetatable"] = true;
["tonumber"] = true;["tostring"] = true;["type"] = true;["unpack"] = true;["xpcall"] = true;
-- Lua Variables
["_G"] = true;["_VERSION"] = true;
-- Lua Tables
["bit32"] = true;["coroutine"] = true;["debug"] = true;
["math"] = true;["os"] = true;["string"] = true;
["table"] = true;["utf8"] = true;
-- Roblox Functions
["delay"] = true;["elapsedTime"] = true;["gcinfo"] = true;["require"] = true;
["settings"] = true;["spawn"] = true;["tick"] = true;["time"] = true;["typeof"] = true;
["UserSettings"] = true;["wait"] = true;["warn"] = true;["ypcall"] = true;
-- Roblox Variables
["Enum"] = true;["game"] = true;["shared"] = true;["script"] = true;
["workspace"] = true;
-- Roblox Tables
["Axes"] = true;["BrickColor"] = true;["CellId"] = true;["CFrame"] = true;["Color3"] = true;
["ColorSequence"] = true;["ColorSequenceKeypoint"] = true;["DateTime"] = true;
["DockWidgetPluginGuiInfo"] = true;["Faces"] = true;["Instance"] = true;["NumberRange"] = true;
["NumberSequence"] = true;["NumberSequenceKeypoint"] = true;["PathWaypoint"] = true;
["PhysicalProperties"] = true;["PluginDrag"] = true;["Random"] = true;["Ray"] = true;["Rect"] = true;
["Region3"] = true;["Region3int16"] = true;["TweenInfo"] = true;["UDim"] = true;["UDim2"] = true;
["Vector2"] = true;["Vector2int16"] = true;["Vector3"] = true;["Vector3int16"] = true;
}
local function idump(tok)
--print("tok unknown:",tok)
return coroutine.yield("iden", tok)
end
local function odump(tok)
return coroutine.yield("operator", tok)
end
local function ndump(tok)
return coroutine.yield("number", tok)
end
local function sdump(tok)
return coroutine.yield("string", tok)
end
local function cdump(tok)
return coroutine.yield("comment", tok)
end
local function lua_vdump(tok)
-- Since we merge spaces into the tok, we need to remove them
-- in order to check the actual word it contains
local cleanTok = string.gsub(tok,Cleaner,"")
if lua_keyword[cleanTok] then
return coroutine.yield("keyword", tok)
elseif lua_builtin[cleanTok] then
return coroutine.yield("builtin", tok)
else
return coroutine.yield("iden", tok)
end
end
local lua_matches = {
-- Indentifiers
{Prefix.. IDEN ..Suffix, lua_vdump},
-- Numbers
{Prefix.. NUMBER_A ..Suffix, ndump},
{Prefix.. NUMBER_B ..Suffix, ndump},
{Prefix.. NUMBER_C ..Suffix, ndump},
-- Strings
{Prefix.. STRING_EMPTY ..Suffix, sdump},
{Prefix.. STRING_PLAIN ..Suffix, sdump},
{Prefix.. STRING_INCOMP_A ..Suffix, sdump},
{Prefix.. STRING_INCOMP_B ..Suffix, sdump},
{Prefix.. STRING_MULTI ..Suffix, sdump},
{Prefix.. STRING_MULTI_INCOMP ..Suffix, sdump},
-- Comments
{Prefix.. COMMENT_MULTI ..Suffix, cdump},
{Prefix.. COMMENT_MULTI_INCOMP ..Suffix, cdump},
{Prefix.. COMMENT_PLAIN ..Suffix, cdump},
{Prefix.. COMMENT_INCOMP ..Suffix, cdump},
-- Operators
{Prefix.. OPERATORS ..Suffix, odump},
{Prefix.. BRACKETS ..Suffix, odump},
-- Unknown
{"^.", idump}
}
--- Create a plain token iterator from a string.
-- @tparam string s a string.
function lexer.scan(s)
local startTime = os.clock()
lexer.finished = false
local function lex(first_arg)
local line_nr = 0
local sz = #s
local idx = 1
-- res is the value used to resume the coroutine.
local function handle_requests(res)
while res do
local tp = type(res)
-- Insert a token list:
if tp == "table" then
res = coroutine.yield("", "")
for _, t in ipairs(res) do
res = coroutine.yield(t[1], t[2])
end
elseif tp == "string" then -- Or search up to some special pattern:
local i1, i2 = string.find(s, res, idx)
if i1 then
idx = i2 + 1
res = coroutine.yield("", string.sub(s, i1, i2))
else
res = coroutine.yield("", "")
idx = sz + 1
end
else
res = coroutine.yield(line_nr, idx)
end
end
end
handle_requests(first_arg)
line_nr = 1
while true do
if idx > sz then
while true do
handle_requests(coroutine.yield())
end
end
for _, m in ipairs(lua_matches) do
local findres = {}
local i1, i2 = string.find(s, m[1], idx)
findres[1], findres[2] = i1, i2
if i1 then
local tok = string.sub(s, i1, i2)
idx = i2 + 1
lexer.finished = idx > sz
--if lexer.finished then
-- print(string.format("Lex took %.2f ms", (os.clock()-startTime)*1000 ))
--end
local res = m[2](tok, findres)
if string.find(tok, "\n") then
-- Update line number:
local _, newlines = string.gsub(tok, "\n", TABLE_EMPTY)
line_nr = line_nr + newlines
end
handle_requests(res)
break
end
end
end
end
return coroutine.wrap(lex)
end
function lexer.navigator()
local nav = {
Source = "";
TokenCache = table.create(50);
_RealIndex = 0;
_UserIndex = 0;
_ScanThread = nil;
}
function nav:Destroy()
self.Source = nil
self._RealIndex = nil;
self._UserIndex = nil;
self.TokenCache = nil;
self._ScanThread = nil;
end
function nav:SetSource(SourceString)
self.Source = SourceString
self._RealIndex = 0;
self._UserIndex = 0;
table.clear(self.TokenCache)
self._ScanThread = coroutine.create(function()
for Token,Src in lexer.scan(self.Source) do
self._RealIndex += 1
self.TokenCache[self._RealIndex] = {Token; Src;}
coroutine.yield(Token,Src)
end
end)
end
function nav.Next()
nav._UserIndex += 1
if nav._RealIndex >= nav._UserIndex then
-- Already scanned, return cached
return table.unpack(nav.TokenCache[nav._UserIndex])
else
if coroutine.status(nav._ScanThread) == 'dead' then
-- Scan thread dead
return
else
local success, token, src = coroutine.resume(nav._ScanThread)
if success and token then
-- Scanned new data
return token,src
else
-- Lex completed
return
end
end
end
end
function nav.Peek(PeekAmount)
local GoalIndex = nav._UserIndex + PeekAmount
if nav._RealIndex >= GoalIndex then
-- Already scanned, return cached
if GoalIndex > 0 then
return table.unpack(nav.TokenCache[GoalIndex])
else
-- Invalid peek
return
end
else
if coroutine.status(nav._ScanThread) == 'dead' then
-- Scan thread dead
return
else
local IterationsAway = GoalIndex - nav._RealIndex
local success, token, src = nil,nil,nil
for i=1, IterationsAway do
success, token, src = coroutine.resume(nav._ScanThread)
if not (success or token) then
-- Lex completed
break
end
end
return token,src
end
end
end
return nav
end
return lexer
Edit: Check out the true latest on the GitHub repository instead! New features, perf improvements, and a syntax highlighter with RichText!
I can’t wait to try this mad fortune
You really make our lives easier. Thanks to the awesome open source devs
I was looking for a lexer which was not part of a lua compiler. I was thinking of writing my own and started even doing so but thank god I found this. Now I just need to write a parser for it. But thank you!
This is very useful! This might seem like a stupid question, but why use _
before some properties of the nav
object (_RealIndex, _UserIndex and _ScanThread) but not in others (Source, TokenCache)?
Also in the Api & Usage section there’s a typo.
Fixed, thanks for catching that!
It’s general practice to use a _ prefix to denote “private” values. That is to say, when you use the nav object, you shouldn’t ever be directly interacting with nav._ScanThread
, because you risk breaking the nav if you alter it. Those values as used internally by the object itself, but aren’t recommended to be used by the higher level.
Source and TokenCache are “public” because I’ve found it’s useful to check what source is currently active and to directly index a token at a specific place (ex: checking the first token).
Hey guys! I’ve continued working on this!
I’ve refactored it (no more coroutines or return function list) and added contextual tokenizing (os.clock()
will highlight clock
as a builtin, and myThing.spawn()
will no longer highlight spawn
as the builtin function).
I’ve also released a highlighter module to go with this lexer! It uses RichText labels to highlight the code based on the tokens the lexer spits out.
Rather than continue to post into this obscure thread, I’ve opened a GitHub repository for it:
Lexer is breaking down code into different segments, often referred to as tokens according to token types (example keywords, operators, strings, brackets) and other criteria. This is used to parse, analyze, highlight and evaluate code syntax.
I like it. This lexer version is much more cleaner and understandable, I really didn’t like the coroutines that you were previously using. Hopefully with that change you will be able to add the newer and more complex syntax highlighting for methods, function names, properties and others that Luau has got in Studio. Good job!
Also, since it’s a GitHub repo, will other developers be able to contribute to this project as well?
Contribution is welcome! I will likely be adding more features as well. You can also open GitHub issues with suggestions/requests/problems!
Hey, I was wondering how you show this in a textbox so that the user can copy and paste it? Or is that not something this can be used for? If not then I’m still not sure how to even show it in a textlabel.
You can have a textbox, but the highlighted text has to be in a separate textlabel. Then just update the textlabel’s highlighting every time. the player inputs some text into the textbox. You could have the textlabel overlap the textbox or just put it next to it.
Ok, but I still have no clue how I use this, would you happen to know?