Best [rbx] lua 5.1 abstract syntax tree?

I’m not very experienced in this topic, so I would like to immediately dive into working with an AST and learn the standards others use when structuring their ASTs

Do you guys know of any great ones? I want something that has a great documentation (or please direct me to where the AST standards are described), because if it comes down to deciphering code, I would rather reimpliment it instead

Thank you

1 Like

Do you mean grammar? AST is just that: arbitrary. There is no standard Lua AST.

Roblox Lua doesn’t add any new syntax (unlike others like Garry’s Mod’s Lua, which adds things like continue and !), so a normal Lua 5.1 grammar file will work for Roblox.

By searching “Lua grammar”, you should be able to find various ways the Lua grammar has been written.

The Lua 5.1 Reference Manual includes grammar parsings of its syntax as it teaches them: Lua 5.1 Reference Manual

Example:

2 Likes

A couple months back I wrote a complete Lua 5.1 parser. With all possible bias, this is the most readable and documented Lua parser that I know of.

There’s a parser in lua-minify, written by @stravant, which luasyntax is partially based off of.

There’s luacheck, written in Lua, which contains a lexer and parser.

The Lua source code itself is also an excellent resource.

7 Likes

well after spending far too long trying to get antlr to work (successfully install* lol idk why this is so hard) I have given up :confused:


I will try @Anaminus’s parsers tomorrow but am I completely missing something because earlier today when I looked at stravant’s prasers, I felt like both the lua-minify you linked and (maybe an older version?) GitHub - stravant/LuaMinify: Lua source code minifier. have no documentation
do I just need to learn to read better or what is it?

No, both versions are almost entirely undocumented as far as parsing goes, and are quite challenging to read. As I looked to it for reference, walking through the code manually was really tasking. It’s a lot of the reason why I spent time documenting mine.

As mentioned above, grammars are very helpful for learning about parsers. They provide insight into how a parser can be structured in a general way. It’s so general that a grammar can actually be parsed to generate code that parses the language described by the grammar (YACC is such an example). You might turn here if you prefer learning parsers top-down.

For learning them bottom-up, I’m not really aware of an implementation of any kind that’s known for being a good reference for learning about parsers in general. Most parsers are written for parsing rather than teaching, so there’s really not much else to do than diving in and figuring it out. A debugger that lets you walk through the code step by step is very helpful for this. I think I used Studio’s debugger to walk through lua-minify at some points.

And of course, implementing your own parser is highly recommended. Parsing JSON is a great place to start, as the format is small, simple, and useful in practice. For a full Turing-complete language, Lisp also has a very easy syntax. Actually, Lisp and its many dialects are used often for teaching, so there’s probably a good reference Lisp parser out there somewhere.

3 Likes

There is and I’ve made my own from it before.

1 Like

The Lua 5.1 grammar as shown in the manual may be suitable for building a parse tree, but is too complex to make a desirable abstract syntax tree. Not to mention that AST are usually generated on a function by function basis rather than file by file. Here is an explanation of the GNU compiler collection’s AST. [here]

I’m currently working on building a GLR parser which I plan to use to add type checking. GLR parsers are known for how easy they are to use, because grammars do not have to be mutilated to be parsed by a GLR parser. In addition, my parser requires no installation, and can be run in any 5.1 or newer Lua environment. Do note though that RBXLua’s require takes in an instance instead of a string path, so to run the parser in RBXLua you will have to edit the require paths.

I haven’t put much though into ASTs for Lua yet, but I can imagine a basic outline. Here is an example of what I envision an AST would look like after having been converted from the parse tree:

CHUNK:
	statements = (LOOP | BRANCH | CALL | ASSGN | BREAK | RETURN)*

The parsing done before makes sure that a return statement isn’t in the middle of a chunk, and other things like that.

LOOP:
	chunk = CHUNK

Every loop in Lua can be converted into a section of code that continuously gets rerun until a break or return is encountered. I’d add assignment and branch nodes to the loop’s chunk node to represent for loops, generic for loops, repeat loops, and while loops. This huge abstraction from the parse tree.

BRANCH:
	condition = EXP
	chunk = CHUNK
	[next = BRANCH]

The EXP for an else branch would be set to true, allowing all parts of an if statement to be abstractly represented as a branch node.

CALL:
	prefix = MEM
	args = (EXP)*

The call node would contain a MEM node, and links to the argument expression nodes. A method call using : would simply prepend the prefix to the argument list. Don’t ask me about metatables. It seems to me that metatables make almost any static semantic analysis very, very difficult. Just don’t use them :stuck_out_tongue:

ASSGN:
	locations = (MEM)*
	values = (MEM | EXP)*

Assignment nodes besides dealing with =, also handle function declarations. function foo() end is simply syntactic sugar for foo = function() end. They both have the same semantic meaning. It stores a list of variable nodes on the left, and a list of expression nodes on the right.

After that, EXP and MEM nodes get a bit tricky. I’d have to think about it some more. The concept is that MEM nodes are memory locations that can be assigned to or used in an EXP, while an EXP is the result of an operation that cannot be assigned to because it doesn’t represent a memory location. It is similar to the difference between a traditional right-hand and left-hand expression. EXPs would also need to include FUNC, TABLE, STR, and NUM nodes. FUNC nodes recursively call the AST builder and contain a link that can be later used in a CALL, TABLE is a regular table declaration, while STR and NUM are simply inline constant strings and numbers.

Now this is just one example. Generally ASTs are very diverse because they serve very different purposes. Some are for optimization so flow-control decision points are emphasized while other parts of abstracted away and attributes for constants to be used in constant folding and propagation are added, while other are for type analysis and have built in structures or attributes for different types. Basically the goal of an AST is to be the simplest structure (most abstracted) possible while still allowing desired operations/analysis to be performed on it.

Here are some general guidelines that would apply to designing an AST:

  • Have a clear definition of the purpose of the tree
  • Start with the parse tree and combine nodes that are not important to the purpose
  • Take out as much parse/syntax information as possible (abstraction)
    • variable names? Just need a link in the symbol table.
    • parenthesis
    • : method calls (syntactic sugar that can be abstracted away)
  • Add any semantic information important to the purpose
    • attributes like isConstant or type
    • references to the symbol table or similar structures
    • structures to be mutated in operations acting on the AST
3 Likes

but be careful because the following two are not the same:

local f=function()
--no access to local f in this function because f is defined after the function
end
local function f()
--access to local f in function
end

also judging the difference between a parse tree and ast by this, I think you are right in that I should be looking for a lua parse tree not an ast

I will be postponing when I start working with an AST (I think I will use yours) because I want to spend time learning Go since it seems like such a beautiful programming language

Given the descriptions provided in this thread, it’s definitely more accurate to say that I implemented a parse tree rather than an AST. I’ll have to go in and fix the API to reflect that. You learn something new every day.

1 Like

I’m currently taking CS 5830, “Compiler Engineering,” which has been extremely useful. We are lightly touching most of the chapters of Engineering a Compiler; I found it to be a very interesting read and you probably would too.

3 Likes

I’ve never managed to even read a paper before so I’m not sure how able I will be with an entire textbook, but I can try ;3

Maybe these are meta questions, but how do you recommend I go about the reading? Should I find a video lecture series that follows along with this (or a similar) textbook? Is it a good idea to read every chapter, or how do I determine which ones? Is printing out pages of the textbook/scouring local libraries for it a good idea(idk if libraries even have these textbooks xd)?

Any other suggestions would be greatly appreciated

Well, what are you trying to achieve? I’d be happy to direct you. I understand that you’d like to learn about parse trees (Chapter 3: Parsers, section 2: expressing syntax) but what why do you want to know?

Generating parse trees is usually a byproduct of the parser. Parsers are pushdown automata, which means they are a deterministic finite automata which uses a stack. You can think of a DFA as a list of states which with transition symbols between them. The DFA remembers the state it is in, reads in the next symbol, and transitions to the next state as defined by the input. In the case of LR parsers, parsers push tokens onto the stack (shift). Once the proper tokens/productions to build up a production of a grammar are present on the stack (the DFA keeps track of that) then the tokens/productions are popped off of the stack (reduce) and the parser goes to the last state in the DFA on the stack (that is why a stack is needed, a PDA, not just a DFA) and pushes it into the stack. The easiest way to make a parse tree from this would be to attach the popped off productions during a reduction to the item pushed onto the stack. That will keep track of which productions made up the new production on the stack. When the final reduction takes place, (a CHUNK in Lua’s case) then it will contain all the statements that made up that chunk, and those statements would contain all the productions/terminals that made them up. For recursive descent LL parsers, which are generally more ad hoc and commonly handwritten for each grammar, the uppermost function creates the CHUNK node, and fills it in as the calls to the STATEMENT function returns. LL can be thought of as top-down parsing, while LR can be thought of as bottom-up parsing.

Usually the parse tree is restructured into a more suitable intermediate representation (IR). Most of the time this is an AST, but ILOC is another common choice for those who don’t like graph theory. Sometimes the parse tree is skipped completely because the parser used an attribute grammar framework for semantic analysis (section 4.3), syntax directed translation after reductions (4.4) or stores the order states were popped off its stack and calls semantic actions in that order (post parse SDT, what my parser does since it handles ambiguity which may not be resolved until the end of the parse). GCC uses an AST IR, but my professor says he prefers an ILOC IR. ASTs are then used in GCC’s backend to generate assembly for different architectures, to perform type checking, and to optimize the code. Most IRs can also be used to perform security analysis or even to transpile into other programming languages.

Parsing is such a broad subject that has been studied since the birth of programming languages; as such it has a lot of material floating about!

(btw, I just pushed an update to my parser, it now doesn’t crash on grammars that allow epsilon productions! Also, main.lua is now parsing using the full Lua grammar instead of the simpler example grammar.)

3 Likes

I didn’t really understand your second and third sections so I probably need to start from the beginning
Initially I wanted to just make an AST that I could use to generate my own custom bytecode to complement a bytecode interpreter (I also need to make that but it is definitely a lot easier) to annoy exploiters trying to decompile my code. But now I think it would be cool to explore more of this subject both because of curiosity and because I think it would make cool projects (the following are specific examples I have thought of which I want to make in the future) to create obfuscated logic, auto optimize code, grammar based find and replace stuff (like perhaps automatically changing my string-only function calls to strictly being written as f'string' as opposed to f("string")), and (kind of extreme but) perhaps a custom lexer+script editor or language that transpiles to lua (probably would be converted to bytecode that is run by a lua based interpreter). I can’t say for certain how much of these I will actually follow through with, but the learning I do along the way nonetheless will be useful
I think it would be most helpful if I had a structured curriculum to follow such as a book or lecture video series, but (as mentioned in my previous response) I am a little worried about how well I would do with just a book
Would you recommend any free lectures (such as one from MIT’s OCW) on this topic?

Ah, I see. You want it all! :stuck_out_tongue: You will need a parser to generate the AST, most likely by generating a parse tree first. I’d say take an existing one, like @Anaminus’s LL(1) parser (if you want to program in Go) or my GLR parser (if you want to program in Lua). Parsers are machines of pure Computer Science (CS), and building up the concepts to them takes time. Really the best way to study it is by taking upper division CS courses from your local university. Note that the class I am taking is CS 5830 (5000 to 6000 being master’s degree levels), and it really builds up! So I’d recommend taking an existing parser for now. If not, you will want to research/study Chomsky hierarchy (regular grammars (related to regular expressions, tokenizers), context-free grammars (syntax, parsers), and context-sensitive grammars (english, semantics). You will also want to study Turing machines, PDA, and DFA. Lambda calculus set theory are bonuses. You’ll want to have a working understanding of graph theory and a solid understanding of basic data structures like stacks and linked lists. If you want to try hacking a parser together you probably can, but it would be by far easiest to make a LL(1) parser like Anaminus’s anyways, so you mind as well use his.

Once you have a parser building a parse tree optimization is chapter 8. The meat is about 50-60 pages, but it doesn’t go into depth on specific optimizations and just glances over and introduces general concepts. Information on most optimizations can be found on wikipedia or in GCC documentation. I haven’t studied logic obfuscation, so IDK where would be best to look for that. But really, there are obfusctors out there which you could just send a POST request to once you got your code. I’m also opposed to obfuscation, it slows things down, provides no guarantees, and it sole purpose seems to be to annoy the heck out of everyone. xD Grammar based find and replace can be performed on the parse tree itself. I’d look into depth vs breadth first traversal (one is easily done via recursion, the other via a loop and stack).

When you say a custom lexor and script editor or language that transpiles to Lua, I believe that you are talking about writing a custom syntax that transpiles into Lua. That’s basically what I’m planning on doing with my typed Lua language, Luafide. Regular Lua will still be valid Luafida, but the grammar will have some additional syntax to define types. The types will then be checked and stripped out when compiled. Things like macros, basically compile time functions/in-lines, are pretty easy to add to any language. I decided to make a GLR parser because it handles any CFG you can write. LL or even LR parsers have limitations and if you violate them then the parser may fail to find the parse, crash, or infinitely loop. It requires knowledge of formal grammars to perform operations like left-hand recursion elimination. I wrote my GLR parser to skip all the hassle of manipulating grammars: you give it a grammar, and it works! So if you plan to do a lot of manipulating grammars, I can only recommend using a full CFG parser like GLR, Earley, CYK, or the more recent (2010) GLL (the most desirable of those being GLR (see here)).

Well, I’ve written a lot but feel like I’ve hardly scratched the surface. Compiling is such a HUGE topic that it’ll take a couple weeks to wrap your mind around, months to understand, and years to master. The good news is the each subtopic is really interesting (at least I think so)! Since your not much of a book person, I’d recommend taking a more hands-on approach, then divide and conqueror. You can have a long term goal, but break it down and celebrate the steps, even small steps in this field are huge accomplishments! I’d also recommend reusing other people’s code to allow you to reach usable end-products quickly. Then, you can replace those parts with your own later. You’ll find that for each subtask you research there will be the challenge of finding exactly what it is called in literature (I was self taught, so I had concepts of things I didn’t know the name for when I entered college), and translating that they are saying into concepts that you understand.

So, use someone elses’s parser. Make a converter from a parse tree to Lua or bytecode. From there, add in optimization/obfuscation operations one by one. Editing the syntax requires changing the parser, so at this point you may want to roll your own (unless you use a parser generator, like my GLR). Study some of the topics I listed above when you come across them in literature for the task you are working on. You’ll find that most sources already expect you to know the things I listed, and so will simply mention them instead of being a babystep tutorial, because there is just so darn much to it.

2 Likes

Hey, so I looked around and found Lecture Notes | Computer Language Engineering | Electrical Engineering and Computer Science | MIT OpenCourseWare
I’ve read through the first session’s lecture notes (slideshow) and, as of writing this, I am currently looking through lecture two. Just by glancing through this would you say it is a good resource, or should I look further?

That seems to broadly summarize how to do things, so it may be useful. If you’re like me, you’ll need to consult multiple sources until you have a solid understanding of a concept. It is useful to see how multiple people do it, so you can determine what is personal preference and what is important to the concept.

I recently found this professor’s videos to be really good, here is his tutoral series on finite automata and formal languages (98 videos):

Data structures: (141 videos)

Compiler Design: (104 videos)

And if you can understand his accent, this gentleman has some truly excellent videos. I had trouble implementing how to compute the canonical collection of LR(0) items until I listened to one of his videos 4 or 5 times. I actually left a comment on that video, maybe you’ll see it one day!

3 Likes