I would shamelessly self-advertise by GLR parser again, but it doesn’t have semantic analysis yet, so while it could parse it, you wouldn’t be able to do anything with the parse.
With any parsing it is easiest to break it down into three steps: lexical, syntactic, and semantic analysis. Lexical analysis is commonly called tokenizing and can be performed using regular expressions. It combines the stream of characters into keywords, punctuation, and similar items. Once this is done, syntactic analysis can be done using any context free grammar, meaning that a more traditional parser is needed. It includes putting the tokens produced in the previous stage together to form if statements, loops, assignments, expressions, and so on. Semantic analysis is taking the syntax and determining what everything means when put together. It can only be performed by machines able to operate on context-sensitive grammars, like the Turing machine. Luckily, most programing languages (besides like HTML, XML, CSS) are Turing complete. This is the most complex stage, but performing the previous steps makes it much, much easier. 5 on its own means nothing, but with context around it, ‘return 5’ or ‘foo = 5’ does.
From what I can tell, the grammar that defines your language is very simple and can be easily handled by any parser, almost no matter what method you choose. Here is a suitable grammar:
STATEMENT => select VALUES at NUMBER
VALUES => VALUE RELATION VALUES | VALUE
VALUE => NUMBER | RANGE
RANGE => NUMBER thru NUMBER
RELATION => and | or
And here is what I would use to tokenize the input (untested):
local function tokenize(input)
local tokens = {}
local tokenTypes = {}
for token in input:gmatch '%w+' do
local i = #tokens + 1
tokens[i] = token
tokenTypes[i] = token:match '^%d+$' and 'NUMBER' or 'KEYWORD'
end
return tokens, tokenTypes
end
I’d recommend making what is commonly called a LL(1) parser. As far as parsers go, this is one of the easiest to create. It is a recursive descent parser, and while not the most powerful, it will work for the grammar described above and I believe most grammars describing query-language-like languages.
Your LL(1) parser should probably end up being about 100-150 lines. I’d recommend reading a bit about them online (parsers are a well explored field with plenty of material). Here is a basic framework to get you going. You basically create a function for every left-hand side production in the grammar above:
EDIT: turns out I had so much fun making the example framework, and that the framework made it so easy to add the productions, that I just made it all… This parser should work for you (but note that it is untested, I’ll leave that to you):
local tokenize = require(script.tokenize)
local getToken
local getType
local nextToken
local success
local function is(literal)
local token = getToken()
success = token == literal
if success then
nextToken()
return token
end
end
local function isA(tokenType)
success = getType() == tokenType
if success then
local token = getToken()
nextToken()
return token
end
end
local setMTOnCall = {
__call = function (mt, t)
return setmetatable(t, mt)
end
}
-- op types
local OR = {
default = false;
cont = function() return success end;
}
OR.__index = OR
setmetatable(OR, setMTOnCall)
local AND = {
default = true;
cont = function (option)
if success then
return true
elseif not option then
return false
elseif option == 'STOP' then
success = true
return false
elseif option == 'CONT' then
return true
else
error 'Unrecognized AND operation'
end
end;
}
AND.__index = AND
setmetatable(AND, setMTOnCall)
local branch
local productions = {
STATEMENT = AND {
{is, 'select'};
{branch, 'VALUES', 'values'};
{is, 'as'};
{isA, 'NUMBER', 'as'};
};
VALUES = AND {
{branch, 'VALUE', 'value'};
{branch, 'RELATION', 'relation', 'STOP'};
{branch, 'VALUES', 'next'};
};
VALUE = OR {
{isA, 'NUMBER', 'number'};
{branch, 'RANGE', 'range'};
};
RANGE = AND {
{isA, 'NUMBER', 'start'};
{is, 'thru'};
{isA, 'NUMBER', 'finish'};
};
RELATION = OR {
{is, 'or', 'op'};
{is, 'and', 'op'};
};
}
local root
function branch(name)
local newRoot = {
root = root;
name = name;
}
root = newRoot
local actions = productions[name]
for _, action in ipairs(actions) do
local v = action[1](action[2])
if action[3] then
root[action[3]] = v
end
if not actions.cont(actions[4]) then
return
end
end
success = actions.default
root = root.root
end
return function(input)
local tokens, types = tokenize(input)
local i = 1
function getToken() return tokens[i] end
function getType() return types[i] end
function nextToken() i = i + 1 end
root = { name = 'ROOT' }
branch 'STATEMENT'
return success, root
end
When included as a module and called with the input string, it will spit out two values. The first is if the input string is a part of the language defined by the grammar. The second is the root of the parse tree. It is your job to navigate through the parse tree to convert it into the actions that you want (semantic analysis). I’d recommend feeding it a simple test string and then printing out the resulting tree to get a feel for the trees it produces. The third element of each production action controls where the parsed token is stored in the parse tree, if present.
LMK if you run into problems debugging the parser (I have lots of practice making parsers in Lua, so it should be close to working).
Edit: fixed the option
variable in AND.cont. It referred to action[4]
, which DNE in the context, but is exposed through the option
parameter.