Faster Lua VM: Studio beta

1waffle1 · July 9, 2019, 9:41pm

Calling objects like they’re functions is the only thing that is not going to work anymore. It wasn’t supposed to work before. Everything that worked before that isn’t calling objects like they’re functions is going to work in the new VM. If you have some example of something that worked before that isn’t calling objects like they’re functions, it’s going to work in the new VM. Nothing that worked before is going to stop working, except for calling objects like they’re functions, which never should have worked in the first place.

Amiaa16 · July 9, 2019, 10:34pm

Spoiler

Yes, we get that xd

Despite that, we don’t know yet whether calling objects like functions is the only thing that isn’t going to work. Then there is also stuff like the depth limit, and some known and not yet known bugs which are yet to be discovered.
Redoing the entire vm is a big change which is bound to have some bugs. There are some details which haven’t been mentioned, which might break a specific piece of code. So in my opinion it’s too early to say something like this yet

Osyris · July 9, 2019, 10:38pm

I think the key here is “going to work”. If something is different and it’s not mentioned in the behavior changes section, it’s a bug.

Autterfly · July 9, 2019, 10:47pm

This is speaking completely from the implementation of vanilla Lua.

8 bytes (which isn’t the actual amount) is not a lot to worry about, and is necessary for the behavior of Lua upvalues. At almost any scale it should be just about negligible, unless for some reason many functions are being nested or quickly instantiated which may indicate another problem.

Vanilla Lua defines upvalues as: Lua 5.1.5 source code - lobject.h, and LClosures (Lua functions) keep an array of them. The UpVal struct is a bit bigger than 8 bytes, so I’m not sure that the idea of this being a performance problem is founded in proper research.

Globals are also stored in a table, so they will probably end up using more memory than just the upvalue approach; Lua defines an entry in a hashtable here: Lua 5.1.5 source code - lobject.h

Upvalues references are also collected as soon as their references go out of scope, while a global in the environment lives for as long as that environment. Although all of this together doesn’t give a good reason for trying to crunch these things down even smaller. The new codebase Roblox has might even handle it differently already, but the way vanilla Lua does it is already lightweight and out of the way.

I also personally don’t know if it’s possible to easily replicate the behavior of upvalues we have now with less memory.

Tomarty · July 10, 2019, 12:15am

Here’s a memory usage difference on various structures with and without the new VM. The first one is from last month before the Studio beta was introduced. These were just done in studio, so the live game may be different.

There are quite a few improvements, but function memory usage is the same across the board.
Looks like I was wrong about how much memory is used on top of the +8 per function.

Here’s the same tests run on the current live game using my “distributed profiler” after about 250m iterations across a few dozen clients (random people who followed me there).

Here are the performance tests with their periods relative to creating a blank function:

“func (1gu)” is a function with 1 “global upvalue” or shared upvalue, and you can see that it is about 1.1x slower than creating a function that has no upvalues.

I didn’t put very much thought into the naming of the tests, so here’s the code that the memory and performance tests both use:

Add("control", function() return false end)

Add("table (0)", function() return {} end)

Add("func (0)", function(v1) return function() end end)

Add("table (1h)", function() return {[1] = nil} end)

Add("table (1a)", function() return {nil} end)

Add("func (1u)", function(v1) return function() return v1 end end)

Add("table (2h)", function() return {[1]=nil,[2]=nil} end)

Add("table (2a)", function() return {nil,nil} end)

Add("func (2u)", function(v1,v2) return function() return v1,v2 end end)

Add("table (3h)", function() return {[1]=nil,[2]=nil,[3]=nil} end)

Add("table (3a)", function() return {nil,nil,nil} end)

Add("func (3u)", function(v1,v2,v3) return function() return v1,v2,v3 end end)

Add("table (4h)", function() return {[1]=nil,[2]=nil,[3]=nil,[4]=nil} end)

Add("table (4a)", function() return {nil,nil,nil,nil} end)

Add("func (4u)", function(v1,v2,v3,v4) return function() return v1,v2,v3,v4 end end)

Add("table (8h)", function() return {[1]=nil,[2]=nil,[3]=nil,[4]=nil,[5]=nil,[6]=nil,[7]=nil,[8]=nil} end)

Add("table (8a)", function() return {nil,nil,nil,nil,nil,nil,nil,nil} end)

Add("func (8u)", function(v1,v2,v3,v4,v5,v6,v7,v8) return function() return v1,v2,v3,v4,v5,v6,v7,v8 end end)


local v1,v2,v3,v4,v5,v6,v7,v8;

Add("func (1gu)", function() return function() return v1 end end)

Add("func (2gu)", function() return function() return v1,v2 end end)

Add("func (3gu)", function() return function() return v1,v2,v3 end end)

Add("func (4gu)", function() return function() return v1,v2,v3,v4 end end)

Add("func (8gu)", function() return function() return v1,v2,v3,v4,v5,v6,v7,v8 end end)

Add("func2 (0gu)", function()
return (function()
return function() end
end)()
end)

Add("func2 (1gu)", function()
return (function()
return function() return v1 end
end)()
end)

Add("func2 (2gu)", function()
return (function()
return function() return v1,v2 end
end)()
end)

Add("func2 (3gu)", function()
return (function()
return function() return v1,v2,v3 end
end)()
end)

Add("func2 (4gu)", function()
return (function()
return function() return v1,v2,v3,v4 end
end)()
end)

Add("func2 (8gu)", function()
return (function()
return function() return v1,v2,v3,v4,v5,v6,v7,v8 end
end)()
end)

The memory tests use collectgarbage(“count”), and the performance code is preceded by this:

local function newTest(method)
	return function(count, tick0, tick1, spoof)
		local f = method
		
		tick0 = tick0()
		for i = 1, count do
			f()
		end
		tick1 = tick1()
		
		return tick1 - tick0
	end
end

local profiles = {}
local function Add(name, method)
	profiles[#profiles+1] = {
		Name = name;
		Test = newTest(method);
		TestControl = newTest(function() return false end);
	}
end

This is not always the case.
If a single function has sole access to 1 upvalue, the function will use (88 - 40 = 48) total bytes. According to my tests, 8 bytes will be allocated for each additional instantiated function that references that upvalue.
On the other hand, globals use 40 bytes in the hash table, and ~length_of_string + 33 bytes for storing a global’s unique string in Lua’s string hash. I didn’t include this in the tests, but accessing a global in a function does not affect its memory usage or creation speed:

If a variable is used one or twice, upvalues will use less memory; If a variable is referenced in hundreds of instantiated functions, globals will use less memory.
This doesn’t account for how much memory the global’s string constant uses internally relative to the script’s data, as I’m not sure how Roblox implements that.

For clarification, I’m trying to suggest features that will make my game run faster without relying on setfenv. As far as I know, no other Roblox game uses a generalized data simplification and compile system like mine does, so my use-case is very unique. This post details my setfenv use-case:
https://devforum.roblox.com/t/do-you-use-setfenv-and-if-so-why/236325/28?u=tomarty

Here’s an interesting paper on the subject of closures:

Autterfly · July 10, 2019, 12:32am

This only really clarifies on the actual sizes of things, such as an upvalue for Roblox being stored in approximately 44 bytes. But that brings up the question, what kind of device are you targeting where this is a huge problem, and isn’t being caused by something else such as a decision in the programming paradigm?

willingnerd · July 10, 2019, 12:50am

So to be clear the only things “breaking” are the Incorrect ways of a syntax aka a “hack” type syntax?

Tomarty · July 10, 2019, 12:53am

My goal is to do as much as I possibly can with the Roblox engine. When something in Lua is slow, it means I can do less of that thing. I want my game to have thousands of trees and hundreds of characters. If an API like raycasting is made faster, it means I’ll be able to run a few dozen more characters, and if traversing/creating Lua structures is made faster, my LOD system will be able to run a few hundred more trees before the game lags. Of course bottlenecks are often not Lua-side, but Lua is the only variable I have direct control over, so I try to improve performance as much as I can.

bmcqqqq · July 10, 2019, 6:13am

At this point it seems like you are just trolling and purposely asking the same question which has been answered clearly multiple times. If you are still confused I suggest you reread the thread again. Yes, the only difference with namecall will be not being able to “call” instance methods and normal scripts will work fine. If this doesn’t answer your question I suggest you reread.

nooneisback · July 10, 2019, 6:38am

As pretty much everyone before explained about twice or thrice, you should not worry about anything that is explained on the developer hub or the official 5.1 Lua documentation. If there are any additional stuff we should take care of, they will be mentioned here.

These won’t break:

game:GetService("Players") --This is the official way to get a service and it will work
workspace:FindPartsInRegion3(...) --This is the how you would normally get the parts in a region
game:GetService("Players"):GetPlayerFromCharacter(char) --Here just as an example to clear things out even further

These are the alternative incorrect variants of the above that will no longer work and were only possible because of a bug:

game("GetService","Players") --not gonna work
workspace("FindPartsInRegion3",...) --still not working
game("GetService","Players")("GetPlayerFromCharacter",char) --definitely not correct

willingnerd · July 10, 2019, 10:17am

Alrighty, sorry if it seemed as if I was trolling, It wasn’t my intent.

WingItMan · July 10, 2019, 10:43am

I recall once having performance issues with getting to read mobile gyro/accelerometer. Would this version of Lua help with this sort of thing?

I cannot test this right now as I’m in work.

ForbiddenJ · July 10, 2019, 2:22pm

Does this also mean exploiters would have to completely rework their script injection tools to work with the new VM?

If so double

zeuxcg · July 10, 2019, 4:27pm

I’ll add the test place to the list! Note that we aren’t fully ready to start doing place specific testing - I’ll need to check what the status is, I’ll ping you privately when we’re ready.

zeuxcg · July 10, 2019, 4:40pm

It’s not very difficult but I don’t think we should. Idiomatic iteration is using pairs/ipairs, and we haven’t seen cases where a call to pairs affects performance enough to care; we are planning to optimize calls to certain builtin functions in other ways.

We wouldn’t expose hashLength like that. I’m not sure what table.find is supposed to do here?

I’m not sure what the instructions would do in this case. Keep in mind that all “special” paths for builtins have to painstakingly handle the setfenv/getfenv case - what if you replace setmetatable with setfenv?

Constant upvalues of primitive types are folded into the functions that need them (and stop being upvalues). We don’t currently optimize locals of complex types such as setmetatable in your example. In general we expect local caching to not be as necessary, and probably won’t go out of our way to make local caching faster.

This requires a different mechanism from setfenv. Would injectfenv I noted earlier in this thread work for your usecase?

We plan to optimize the use of upvalues in certain cases but I’m not sure it would significantly impact the memory use.

We have this optimization in our TODO list but it’s very complex to maintain semantics perfectly especially in presence of setfenv - you can mutate the environment of the created function object after the fact, which is how you can observe the difference…

zeuxcg · July 10, 2019, 6:06pm

Both reported bugs - 0/0 misbehaving and very large scripts taking a lot of time to compile - have been fixed in Studio 392 that just went live. Please let us know if you see any other problems with behavior or performance.

We’re getting ready to try this on live games, I expect that we can enable this for some games on server / on desktop next week. Mobile might take a bit longer since we need to make sure all fixes have fully propagated.

Dekkonot · July 10, 2019, 6:23pm

Some optimization for using next like that would be appreciated just for the sake of legacy code. It was a stylistic choice before and with the new VM it’s actively punished people who made that decision despite it not mattering when they made it.

Amiaa16 · July 10, 2019, 6:24pm

Not really. They just have to change the bytecode conversion format to match with the new VM, which some of them have already finished doing.

zeuxcg · July 10, 2019, 6:28pm

Code with next doesn’t run any slower than it used to though, it just doesn’t run as fast as pairs/ipairs.

In general our optimization process relies on identifying things we can improve in code that’s part of our benchmark suite (which is a collection of standalone Lua benchmarks, Lua code that we wrote internally as well as Lua code that some members of the community wrote). This is where the focus is, so we’re more likely to improve something that we see affecting the performance in one or multiple representative tests and less likely to improve something that’s a more niche usecase.

We might implement this specific optimization at some point, but it just isn’t a priority.

Tomarty · July 10, 2019, 10:26pm

Understandable, I’ll make the conversion to pairs / ipairs once the new VM is live and my game’s setfenv use-case is sorted out.

Alright, hashLength is confusing and less useful anyways. I included it because of it’s similarity to the lua_createtable function in the C API.

Creating and populating a table’s array can be more than 2x slower if the table isn’t preallocated first. Here’s a few hours of data my profiler collected on the live game today. The results were gathered from a few dozen players/clients that followed me in game:
The tables are created using the specified method, then values are added to the table up to its length. The loop that sets values in the table is also in the control test so that we can better compare reallocation performance.

It would be possible for a developer to try to create an unreasonably huge table with my proposed API, but this is also true for string.rep. In theory string.rep could use a special string representation for massive results, but for table creation the size could be easily capped at some high arbitrary value without complications.

I’ve implemented some crazy functions for preallocating arbitrary-length tables. This auto-generated one I made does a binary search for the length while omitting lengths that can’t be expressed using standard Lua’s “floating point byte” implementation that it uses to store table lengths in the byte code. If the length is too high it will resort to unpack, being mindful of the LUAI_MAXCSTACK limitation, and subtracting 3 from it to account for the 3 unpack arguments. The source generator lets me fine-tune how many checks it takes to get to more common values to make sure creating small tables is still fast.

I drag an implementation like the one above into most of my projects, and I’d much prefer to use an official API and get a slight performance boost while also reducing some dependency boilerplate. It may be simpler to just type {}, but writing code using table.new(size) can improve readability with respect to how the table will be used.

Oops I forgot to add a value argument to my example!
This API would be similar to JavaScript’s indexOf and lastIndexOf functions.
table.find could work like this:

local function table_find(array, value, indexStart, indexEnd, step)
    for i = indexStart, indexEnd, step do
        if rawequal(rawget(array, i), value) then
            return i
        end
    end
end

It’s possible someone may try to use this API to find an equal-but-not-rawequal value in a list (like a Vector3 for example), but I think rawequal would be sufficient for the majority of uses-cases.

It may be better to omit the ‘step’ argument and instead include a separate API for finding the first object starting from the end of the array, but a step argument would support a wide variety of use-cases. In my uses-cases, the value is most commonly near the top of the array.

My use-cases for a `table.find` API

Stack-based object replication

With respect to the server, I’ve found that the most memory-efficient way to replicate changes to thousands of custom Lua objects is to store the currently replicated objects in an array that is mirrored by the client; When an object is changed or removed, the server simply needs to send the corresponding positions in the array. The performance problems surface when this approach is used to replicate thousands of interactive objects like foliage.

My level-of-detail implementation

The worst performance hit I’ve seen is when I disconnect objects like foliage from my LoD system.
Objects like trees are stored in lists based on their distance to a camera or physics observer. When the observer’s position changes by 2^n studs, it refreshes the corresponding list. Objects gradually move between lists when the camera moves, so that they know when to update their model’s quality level. This is optimized to check the currently updating object first so it’s generally quite fast, but these lists can have hundreds of objects in them, and searching for objects when they need to be removed can cause frame spikes.

I share some of the implementation details and design philosophy here:
https://devforum.roblox.com/t/how-would-one-go-about-making-a-lod-system/29511/9?u=tomarty

Custom signals/events

My game has quite a few signal/event implementations to serve various use-cases. The client creates tens of thousands of signals like this (the server even more) and most of them are implemented using a list of functions. The performance issues arise when disconnecting methods from these custom events. In at least 80% of cases, the method is at the top of the list and is trivial to remove; In other cases a signal may be used many times, and the game can spend a lot of time iterating backwards through long arrays trying to find the method that needs to be disconnected.

A different approach with a fast disconnection time may be to use a hash table where "lookup[method] = true".

Uses 40 bytes per method (compared to 16 for arrays.) This adds up in a Lua-heavy game.
Firing the event is slower than an array. (Although the new VM may improve this.)
Not ideal for the bulk of cases that only have 1 function connected.
Unpredictable call order.
Has problems when firing the event and a connected function disconnects another connected function that is yet to be called. (This is a very important consideration when creating a custom event)

A simpler approach may be to instead simply use BindableEvents.

Bindables do not support tables that use the __call metamethod.
“Connection” objects are always created. (Custom events can have :RawConnect(foo) and :RawDisconnect(foo) methods to avoid the need for creating connection objects)
Bindables serialize their arguments. This makes them great for facilitating safe script-to-script interactions, but makes them unusable for mass use in a game.

A recurring theme among my use-cases is that I need to remove objects from a list as part of a disconnection/cleanup process, and this can cause a frame spike when many things need to disconnect at once (like when a player closes a menu.) These use-cases may benefit slightly more from an API that simultaneously table.remove's the index that it finds, but my main performance concern is with potentially-expensive table searches that need to be done.

My fastest array search implementation uses unpack to test up to 32 values at once starting at the top of the array (array searching should be so much faster than this.) I use my function in 100+ scripts, and I’m sure other devs have uses for this API too.

The Roblox implementation of Lua may be slightly different, but the standard setmetatable function does type-checking which is redundant if it’s expressed explicitly in the source:

static int luaB_setmetatable (lua_State *L) {
  int t = lua_type(L, 2);
  luaL_checktype(L, 1, LUA_TTABLE);
  luaL_argcheck(L, t == LUA_TNIL || t == LUA_TTABLE, 2,
                    "nil or table expected");
  if (luaL_getmetafield(L, 1, "__metatable"))
    luaL_error(L, "cannot change a protected metatable");
  lua_settop(L, 2);
  lua_setmetatable(L, 1);
  return 1;
}

This code does redundant tests too. It also tests if it’s a userdata which isn’t possible in the current API (even with newproxy.)

LUA_API int lua_setmetatable (lua_State *L, int objindex) {
  TValue *obj;
  Table *mt;
  lua_lock(L);
  api_checknelems(L, 1);
  obj = index2adr(L, objindex);
  api_checkvalidindex(L, obj);
  if (ttisnil(L->top - 1))
    mt = NULL;
  else {
    api_check(L, ttistable(L->top - 1));
    mt = hvalue(L->top - 1);
  }
  switch (ttype(obj)) {
    case LUA_TTABLE: {
      hvalue(obj)->metatable = mt;
      if (mt)
        luaC_objbarriert(L, hvalue(obj), mt);
      break;
    }
    case LUA_TUSERDATA: {
      uvalue(obj)->metatable = mt;
      if (mt)
        luaC_objbarrier(L, rawuvalue(obj), mt);
      break;
    }
    default: {
      G(L)->mt[ttype(obj)] = mt;
      break;
    }
  }
  L->top--;
  lua_unlock(L);
  return 1;
}

I’m not sure if this table assertion is debug-only.

#define hvalue(o)       check_exp(ttistable(o), &(o)->value.gc->h)

In fact quite a few library functions could be optimized based on what the compiler knows about the inputs, although at this point we may be compromising on readability and complexity within the Lua source.
I may be entering micro-optimization territory with some of my suggestions, but my game creates a lot of new tables that have metatables, and even a 10% improvement means I can create up to 10% more tables like this before causing a frame spike.

I don’t think so. My use-case needs a fast way to access a shared table where my game’s data is stored. I use __index with the environment so that the data initializes once it’s accessed in a script (so it’s lazy.) The game needs to access this data as fast as possible given a numeric referenceId; Decoding this reference and initializing the data, then caching the value in the environment results in very fast access.

If I used injectenv, I would need to initialize the data before the script needs its, which will cause a frame spike whenever the client receives a batch of uninitialized game data. To clarify, data initialization in my game may involve expensive operations like requiring ModuleScripts or allocating tables; This enumerated game data is sent on a need-to-know basis, and is comprised of anything from custom animations, to ModuleScripts, to language data.

setfenv is also only used once the game is compiled/simplified, which I do right before I publish. Thus It is safe for all ModuleScripts in my game to share the same environment because the globals in my game’s environment are all auto-generated and decoded to get a numeric referenceId. It would be very easy for me to update the game if setfenv support was removed, as the game/source simplification compile process is automated and usually takes less than 15 seconds.

I detail my use case a bit more in this post:
https://devforum.roblox.com/t/do-you-use-setfenv-and-if-so-why/236325/28?u=tomarty

It’s not obvious what API I would need, considering my case is ridiculously optimized for my massive codebase, but I think something similar to Lua 5.2’s environment implementation would facilitate what I need.

Perhaps the global environment could be treated like an upvalue in the VM (and be implemented alongside fast GETTABUP / SETTABUP opcodes like in Lua 5.2), so that I could localize my game’s data system and reference its upvalue directly and receive the same performance as accessing the global environment. Functions that don’t access their global environment could omit this upvalue to save memory, and this would generally improve function memory usage and creation performance; This may result in unexpected getfenv behavior when the function has no environment however, and setfenv may need to reallocate the function with a new upvalue so it can reference a different environment upvalue (I’m not sure if this is possible).

The only time I ever use getfenv on a function is when I’m debugging a complicated issue and need to know what script created an arbitrary function using print(getfenv(foo).script)). Regardless, I would be very open to this change even if it meant the removal of getfenv/setfenv support like in Lua 5.2.