Yo! Introducing P+, a parser generator derived from nearley.js!
Before you go any further, you can’t create a full programming language with this, yet it makes up about 2/3 of the process, next being interpretation or execution of some sort. P+ is simply the parser generator, also providing a lexer, in the plugin.
A while ago I created this topic
And since then, I’ve been really hard at work researching and building a parser. It’s been just over a month now and I’m happy to say it’s finished. Or at least working…
Psst! I’ve also made a lexer, derived from moo.js
Use P+ to create your own parser with informative error messages, and optimized algorithm execution.
Simply define a grammar, and input and boom, your own top down parser. For the convenience of all you lovely developers, i’ve also made a syntax highlighted grammar editor so you can avoid the ugly regular syntax.
Module and Plugin for the impatient:
Module
Time Complexity
A little background, P+ uses the earley algorithm, which means it can parse ambiguous or un-ambiguous grammars. This means things such as right recursion:
prod -> prod %token
Are possible! The only caveat being that it runs in O(n^3) time or a quadratic time complexity. Generally, it’s good practice to shy away from left recursion. Yet using Joop Leo’s right recursion optimization, any grammars that avoid left recursion can run in O(n) time or a linear time complexity.
The Earley Algorithm
While I can’t give you a full breakdown of parsers, lexers and the pack, I’ll give a brief explanation of how it works.
I’ll be providing a tutorial soon on the algorithm, I’ll link it here
For now, refer to the resources provided here:
Parsing Generator using Earley Algorithm
P+
In this section i’ll explain a bit more about the module itself.
Requiring the module returns a dictionary with Parser, and Rule, for usual use you can disregard rule.
We call the Parser function with a grammar
A grammar is a set of production rules used to show P+ the syntax of your generation. If you’d like some resources explaining this stuff refer here Parsing Generator using Earley Algorithm
Calling our parser function returns a table with the most important function being :Parse(input)
“Alright enough of this explain the module already ” Alright got it got it.
First i’ll explain the grammar. A typical grammar looks like this, unless you’re using the plugin:
{
{name = "", symbols = {
}, postProcess = function},
…
lexer = lexer or nil,
startRule = "" or nil
}
For starters our grammar is a mixed table composed of rules, being our numbered indices, and extra data being stringed keys.
Each rule has a name, a compilation of symbols, and a postProcess.
The name is exactly what it sounds like the name, the symbols table is composed of string or table indices.
Terminal being the former, Non terminal being the latter.
A terminal symbol refers to another rule, while a non terminal refers to our input stream. A typical symbol table looks like this.
{
{literal = “!”}, - -Checks current token value for exactly this value
{token = “myToken”}, - -Checks current token *type* for this value
“yo” - -Checks another rule for input validation.
}
Next, our postprocess is a function, it provides the data of the completed rule, which is useful for generating senematic derivatives like an abstract syntax tree.
Notice: Non terminal symbols data provide the token, which is an object, refer to my lexer above to learn more about that
A typical postProcess would look like this:
function(data)
- -Assume production is: "term", {literal = "!"}
return { word = data[1], --"term"
exclamation = data[2].value }
end
Symbols created and defined with a table return their tokens, meaning you will probably have refer to the lexer documenatation, regular terminal symbols or symbols defined with a string return the result of the postprocess of the completed rule stated, by default post processes returns the data table previously showed.
The internals of the parser will be explained later as I’ve said before.
The Plugin
The plugin has three sections, Parse, Lexer, and Load
Parser uses readable syntax to help write grammars. The syntax is:
prodName -> "string" %token terminal
where prodName refers to the name, the arrow symbolizes the start of the symbols, “string” are literals, %token check token types and terminal are terminal rules specified before.
Post processes are defined as {code here} after each expression.
Sub expressions are also supported, it can be used to do the logic of or with the “|” character
prodName -> "string" | %token
String or token of type “token”
Use parentheses to wrap subexpressions into a single token:
prodName -> ("string" | %token) "xolt"
Lexer gives you a list of drop downs to define tokens, refer to the lexer to learn more, but basically you have a match, name and error, match being the regular expression of the token, and error being if the lexer should error if encountering this token.
Finally the loader allows you to load generated parsers into the plugin for editing.
Writing this has shown me just how terrible I am at explaining, but I believe this is a great resource so I wanted to share. For a far better explanation, you can use the nearley explanation: Here
Anyways, hope y’all find this useful and if you find any bugs or things I could improve, please let me know, nicely. Also, you can find a place to try out the parser, it parses arithmetic expressions using the parser in the flesh. Find it in the Earley Algorithm section.
TODO:
Type/Value transforms in Lexer
Regular lua code syntax highlighting for postProcesses in plugin
Loading post processes in plugin
General code clean up
Memory Optimizations
Fixes: