An (almost) clone of the Luau lexer

There’s a problem with RegisterAutocompleteCallback and RegisterScriptAnalysisCallback, these are very poweful functions, but they are underutilised because of one core missing feature

No access to the Ast/Lexer

But, what is an Ast or Lexer? A Lexer breaks down a script into usable tokens that are easier to script around, and an Ast takes that Lexer data (called Lexemes) into a syntax tree. (i have no idea how to write an ast)

So why am I complaining about there not being an API for this, thats off-topic. Well, I made my own Lexer thats forked from the Luau source code (for the most part).

Usage

You can download the Lexer module here (6.5 KB)

Then, you can require it in how you’d load in any module. This module probably doesn’t have much use outside of plugins working with the script editor, so I’ll use that as the usage example

Here’s a script that uses analysis callbacks to highlight every string (QuotedString and BrokenString)

local ScriptEditorService = game:GetService("ScriptEditorService")
local Lexer = require(script.Parent.Lexer)

ScriptEditorService:RegisterScriptAnalysisCallback("cat",1,function(req)
  local scr = req.script
  local tokens = Lexer(ScriptEditorService:GetEditorSource(scr))

  local diags = {}
  for i, token: Lexer.Token in tokens do
    if token.Type == "QuotedString" or token.Type == "RawString" then
      local position = token.Position
      table.insert(diags, {
      range = {
        start = {
          line = position.start.line,
          character = position.start.character + 1
        },
        ["end"] = {
          line = position["end"].line,
          character = position["end"].character + 1
        }
      },
				
      message = "A string!",
      severity = Enum.Severity.Information
    })
    end
  end
  
  return {diagnostics = diags}
end)

This script breaks the script down into tokens using the Lexer module, then highlights every string that was tokenised from said script.

image

Future Plans

Sometime down the road I do want to attempt a go at making a full AST using this as a backend but as I said, I have no idea where to start on that so, for now, I guess the Lexer is good enough.

Attributions

This entire module is referenced from roblox/luau

23 Likes

Please document it; It’s horrible to be messing around with it to figure stuff.

1 Like

would’ve been nice if you documented it!

Man I love when devs post pure hieroglyphs on here and expect everyone else to know what they’re talking about.

3 Likes

The resource is clearly not designed for you then. your lack of knowledge does not make this a hieroglyph :joy:

4 Likes

How is he supposed to get more knowledge if there is no way to learn this via documentation, That’s like telling a child he is stupid because he cant solve algebra when there are no books about it.

People on the devforum should stop being jerks and remember behind the screen there are real people, we were all beginners at one point.

6 Likes

Mate I think a whole 2 people on the platform understand this. I’m probably on the more advanced side of this website and it’s just gibberish to me.

The lack of documentation worsens it

1 Like

That’s true, I really want to implement this module into some of my plugins right now but because of it’s lack of documentation I can’t do that

2 Likes

So I want to add my input separate to what has already been said (which I think is just practically invalid and mindless criticism) and state that ideally this should have been put on GitHub.

Primarily a GitHub release would’ve been preferred here for two reasons:

  1. Allows for community contributions whenever the official C++ version of the Lexer gets updated so that the Luau version can be as recent as possible
  2. If anybody in the future (or yourself) wanted to put in the effort to effectively bring the C++ AST over to Luau than a new branch could be made specifically for that and then later merged once completed.

That’s all I have to say to you - and although simply converting the code itself here is not all that difficult to do - saving everybody else the effort and then putting it out as a resource is great.

For everybody else here who has complained about the lack of documentation. There is not really much to document. A Lexer as a concept is one that is basically self-documenting simply by how it behaves, there’s not much you can add to better describe what it is doing.

Not only that but the module itself only returns one function… with its return types clearly typed/written. So… if you can’t figure out what to do with this module (or how to use it)… I would argue that you probably need to work a little more with the concept since it’s relatively simple and (in this instance) the module has an even simpler return type.

If you’re still struggling, have a look at the exported types within the module. The only exposed function simply returns that of {Token} which if you can’t read that type (then this module really isn’t for you) it is an array of Tokens. The rest is extremely self-explanatory.

(Arguably) The only place where documentation is really needed is to better describe all the Lexeme types. For example “Dot2”, “Dot3”, and “SkinnyArrow” are a few that I think a majority (not having interacted with the Luau Lexer prior) will not necessarily know what these are until messing around with strings going through the Lexer.

However, I won’t disregard some of the comments that actually exist within the C++ Lexer already. Barely any of them are actual comments relating to the behavior or what is being done (most are just repeats talking about why a certain keyword is being used) but the ones that are should’ve also been transferred over for parity.

I also think this would’ve been a great opportunity to use full words for easier reading/understanding (for those new to the concept) like “InterpolatedStringBegin” instead of “InterpStringBegin”.

The Lexeme types should’ve also been broken up into a table clearly idicating what category they fall into and their variant. This would remove the need for having 22 separate keyword entries (all prefixed with “Keyword”) and instead one Keyword entry with 22 variants ({Type: “Keyword”; Variant: (“If” | “Then” | “Else” | …)}). This is specifically helpful when only looking to target certain categories of tokens rather than a specific “type”.

1 Like

There’s no advanced side of “this website”. If a resource exists, if a resource has been made, it’s because people will use it.

It may not be fit with your use cases, but it’s a time saver for people that use or manipulate bulk amounts of Luau code.

This is cool! I just made a simple lexer a few months ago!

I’m also wondering how to create an Abstract-Syntax-Tree in Luau. Hopefully someone creates a tutorial on how to do that or I’ll have to learn how to create one and make a tutorial on it.

Edit: Forgot about this tutorial that uses the “shunting yard” algorithm to make a parser

This is very helpful, I am trying to make a beautifier plugin and I don’t need to convert the luau lexer. I did find some problems in this though, such as the way numbers are read and one operator missing.

Hi, super useful! Thanks for making it.

However, I noticed that strings, numbers, and operators are invisible sometimes?

image

Inputted:

if 5 >= 2 or "this" ~= 20 then
end

This is not my code, it’s something with this module I believe.

Edit:
For anybody else experiencing this, I did manage to solve my issues with numbers like this:

--l616
if isDigit(ch) then
	--return readNumber(start, stream.Offset)
	return token(start, start + 1, "Number", stream:read())