How are Lua VMs possible?

I’ve been thinking about this for a while now, how is it even possible to run code from a string without using loadstring?

An example would be this:

I tried looking through the code but it just looks like gibberish to me :sweat_smile:

I’m just curious on how these work / how it is possible! :grin:

7 Likes

I think you should repost this in #help-and-feedback:scripting-support

1 Like

I just moved the topic! :grin:

1 Like

I’m confused tbh. One thing I know is Roblox Studio runs a Lua VM to execute code.

I myself don’t know but looking at the code, it seems like it creates a dummy environment [inside the script] and translates everything into readable bytecode and runs the code by compiling it.

1 Like

I feel so dumb right now, but I have some questions still:

  1. What is a “dummy environment” or a script environment in general?
  2. What is bytecode?
  3. How do you even compile whatever bytecode is without loadstring?

This is like a matrix/array of bytes.
It contains characters that are functions, it is like a string, they are simply string instructions, The only thing I know is they are just like serialized instructions so that a Lua VM can deserialize them and execute the code, I’m not sure.

1 Like

What I meant by “dummy environment” is that the code being interpreted is exclusive to a table that it’s attached to (similar to how you use the script editor, the script is like the table).


You can read this page about bytecode. (Though, some things might be incorrect)


Technically, the script isn’t running any code. It’s interpreting the source being given to it. loadstring probably does something similar. You can do

loadstring("print(\"string\")")

But nothing would happen unless you tell it to actually execute it.

loadstring("print(\"string\")")() -- executes the given code

Don’t take what I say as factual because I myself am not educated much on this side.

3 Likes

I got interested in this topic, so I decided to clarify some things that this Lua “interpreter” does. Maybe it helps to understand how a VM really works.

Concept

An example is JavaScript, it just runs because of its VM (Chrome V8), but what is a VM? A VM in a language is a program that reads the code you created. Simple isn’t it? But how exactly do they work?

The parse

A VM won’t really understand your code the way you do. That’s why there’s parse, which basically turns all your code into bytecode. But what is bytecode? Bytecode is an easier and faster way for the VM to interpret your code and execute it. So all the code you wrote basically turns into numbers (what do I call it) that no one understands (only the VM was made to understand).

Note: In Lua, constants, locals, upvalues, functions like string “Hello world!” or print function name don’t will be bytecodes (why it’s VM can’t optimize)

There’s still a lot to say here, but I’ll just leave the general idea

Finally… The Interpreter

Finally we are here, the Interpreter. As I said, an interpreter reads your code and executes, but first your code will basically become numbers that are easier and more efficient to read.

Let’s take it easy… Each VM has a different way of parsing, and interpreting. Let’s look at Lua. Honestly, the Lua Interpreter doesn’t have much magic, it’s simpler than JavaScript or another language (if you want to create an interpreted language, I would recommend looking at Lua).

When the VM starts executing the code, it reads each bytecode (for example, the bytecode to subtract is SUB or 6 (it’s number), in vLua) and then executes a substraction because VM knows that 6 represent a subtraction, and then goes to the next bytecode.

Of course, vLua doesn’t work exactly like an interpreter, in fact it simply manipulates the string and try to emulate a interpreter with code blocks and conditional statement, and Lua itself does that for it. (but yeah, it’s hard to do, 5342 lines just for that)

You can see with your own eyes how Lua is parsed here: https://www.luac.nl/ (just a website that turns Lua to bytecode)

Here a simple print("Hello world!")
image

5 Likes

This is incorrect.
Bytecode is made of numbers.

For example:
If we had the Python code

print("Hello!")

our bytecode would be

  1           0 LOAD_NAME                0 (print)
              2 LOAD_CONST               0 ('Hello')
              4 CALL_FUNCTION            1
              6 POP_TOP
              8 LOAD_CONST               1 (None)
             10 RETURN_VALUE

We can see from that human-readable version of the bytecode that our actual bytecode is:

0 0
2 0
4 1
6
8 1
10

or…

0 0 2 0 4 1 6 8 1 10
2 Likes

bytes

Please, take everything I saw with a grain of salt. With different VMs come different implementations. Some may parse directly into bytecode, but who knows. That sort of info isn’t very easy to find, afaik.

I agree with a lot of what you said, but you skipped a ton of steps.
First we the lexical analyses (which is the lexer).

Lexical analyses converts the code into tokens, like:

5+2

—>

[
    {
        type: 'num',
        value: '5',
    },
    {
        type: 'op',
        value: '+',
    },
    {
        type: 'num',
        value: '2',
    }
]

Next is the parser which does syntactic analyses.
Syntactic analyses converts a token ‘stream’ to an object, also known as an ‘Abstract Syntax Tree’, or AST for short.

Pretty much, we take what we have above and convert it to:

{
    type: 'program',
    body: [
        {
            type: 'bin_op',
            operator: '+',
            left: {
                type: 'literal',
                value: 5,
            },
            right: {
                type: 'literal',
                value: 2,
            }
        }
    ]
}

As you can see, we still have no bytecode.
The bytecode is the next stop.

We have to compile it to bytecode.
Our output should be:

0 0      # GET_CONST 0 (5)
0 1       # GET_CONST 1  (2)
           
1          # BINARY_ADD

You may have seen how we’re using some sort of GET_CONST. A lot (if not all) VMs use something called a ‘constant pool’. It’s a list of constants which are stored before hand. It’s nothing too much, so I won’t talk about it. Feel free to search it up though.

Now, after all of those steps, we can finally interpret the bytecode.
We go step-by-step reading each byte and executing something based on the current byte. All bytecode VMs are stack-based.

Since all bytecode VMs are stack-based, 7 will be on top of the stack.
Also, the bytecode I showed is not completely accurate. Some compilers do some optimization and do the addition calculation before-hand.

This topic is very big. I would recommend looking for some tutorials if you’re interested in the topic. There is also a pretty cool tool which actually lets you view the result of bytecode: https://godbolt.org/

2 Likes

If you want to make an interpreted language, I’d recommend starting with an AST-Walker not a VM.

An AST-Walker pretty much goes through the result of the parse directly, rather than compiling it.

The concept of VMs and bytecode can get very confusing very fast. Especially when you have to think about the stack, the call stack, the garbage collector, the instruction pointer, the stack pointer, the variables and a lot of other interesting things.





I must say, the way that these modules are able to compile to bytecode confuses me. Executing bytecode in loadstring was removed a huge while back, and there’s no other way I know of doing getting an actual result from bytecode in Lua. Interesting topic though.

Don’t know what they used exactly, but since the main purpose of VMs is to create a sandbox environment, you can achieve this in 2 ways. Use getfenv/setfenv with custom implementations of sensitive globals such as _G or the DataModel. It’s not a VM, but close enough for most use cases. If you want a true VM, you’ll have to get a Lua interpreter written in Lua. There are quite a few available.

1 Like

You are mistaken.
The main purpose of bytecode VMs is not sandboxing. It’s to create an intermediate-level language. Please do some more research.

1 Like

Do some research yourself before correcting someone. Their real purpose is to abstract a lot of features by serving as an intermediate inexistent architecture. This way, many features that are nearly impossible with standard compilation, like garbage collection, become possible. Being a virtual architecture also means that your code will be compatible with any machine as long as the VM supports its architecture.

They don’t have much use besides sandboxing on Roblox. Since you run Lua in a virtual environment, there is no untracked interaction between the code and the DataModel.

1 Like

What I said was very dumbed down. I never asked for an explanation. Although you are a very good explainer, not gonna lie.

2 Likes

Custom lua interpreters can provide even greater sandbox functionalities if you modify it to your needs, but you can also achieve sandboxing with the normal Roblox loadstring provided you enable it and can make a function environment based sandbox

Also I wonder if anyone implemented typechecking compability to the existing lua parsers/luac (Yueliang, Moonshine, etc)

The post I made was just to give a idea of how it really works
Like how I said here:

But your explanation was really good

1 Like