The Lua 5.1 grammar as shown in the manual may be suitable for building a parse tree, but is too complex to make a desirable abstract syntax tree. Not to mention that AST are usually generated on a function by function basis rather than file by file. Here is an explanation of the GNU compiler collection’s AST. [here]
I’m currently working on building a GLR parser which I plan to use to add type checking. GLR parsers are known for how easy they are to use, because grammars do not have to be mutilated to be parsed by a GLR parser. In addition, my parser requires no installation, and can be run in any 5.1 or newer Lua environment. Do note though that RBXLua’s require
takes in an instance instead of a string path, so to run the parser in RBXLua you will have to edit the require paths.
I haven’t put much though into ASTs for Lua yet, but I can imagine a basic outline. Here is an example of what I envision an AST would look like after having been converted from the parse tree:
CHUNK:
statements = (LOOP | BRANCH | CALL | ASSGN | BREAK | RETURN)*
The parsing done before makes sure that a return statement isn’t in the middle of a chunk, and other things like that.
LOOP:
chunk = CHUNK
Every loop in Lua can be converted into a section of code that continuously gets rerun until a break or return is encountered. I’d add assignment and branch nodes to the loop’s chunk node to represent for loops, generic for loops, repeat loops, and while loops. This huge abstraction from the parse tree.
BRANCH:
condition = EXP
chunk = CHUNK
[next = BRANCH]
The EXP for an else branch would be set to true, allowing all parts of an if statement to be abstractly represented as a branch node.
CALL:
prefix = MEM
args = (EXP)*
The call node would contain a MEM node, and links to the argument expression nodes. A method call using : would simply prepend the prefix to the argument list. Don’t ask me about metatables. It seems to me that metatables make almost any static semantic analysis very, very difficult. Just don’t use them 
ASSGN:
locations = (MEM)*
values = (MEM | EXP)*
Assignment nodes besides dealing with =
, also handle function declarations. function foo() end
is simply syntactic sugar for foo = function() end
. They both have the same semantic meaning. It stores a list of variable nodes on the left, and a list of expression nodes on the right.
After that, EXP and MEM nodes get a bit tricky. I’d have to think about it some more. The concept is that MEM nodes are memory locations that can be assigned to or used in an EXP, while an EXP is the result of an operation that cannot be assigned to because it doesn’t represent a memory location. It is similar to the difference between a traditional right-hand and left-hand expression. EXPs would also need to include FUNC, TABLE, STR, and NUM nodes. FUNC nodes recursively call the AST builder and contain a link that can be later used in a CALL, TABLE is a regular table declaration, while STR and NUM are simply inline constant strings and numbers.
Now this is just one example. Generally ASTs are very diverse because they serve very different purposes. Some are for optimization so flow-control decision points are emphasized while other parts of abstracted away and attributes for constants to be used in constant folding and propagation are added, while other are for type analysis and have built in structures or attributes for different types. Basically the goal of an AST is to be the simplest structure (most abstracted) possible while still allowing desired operations/analysis to be performed on it.
Here are some general guidelines that would apply to designing an AST:
- Have a clear definition of the purpose of the tree
- Start with the parse tree and combine nodes that are not important to the purpose
- Take out as much parse/syntax information as possible (abstraction)
- variable names? Just need a link in the symbol table.
- parenthesis
-
:
method calls (syntactic sugar that can be abstracted away)
- Add any semantic information important to the purpose
- attributes like
isConstant
or type
- references to the symbol table or similar structures
- structures to be mutated in operations acting on the AST