Faster Lua VM: Studio beta

The issue with table.create in general is that it has odd semantics - it creates a table that has a length of 0 with internal storage up to size elements. What are the cases when you need a function like this?

Re: table.find - it does seem odd that Lua doesn’t have a builtin function like this, although it would probably need a simpler interface. I’m curious, what does your current implementation look like? I’d expect that in the new VM, something like for i,v in ipairs(tab) do if v == target then return i end end is going to be faster than anything else you can write in Lua yourself. A C version can be faster of course.

Re: setmetatable, a lot of the checks you mentioned are basically free compared to some other bits, and the bottleneck tends to be object creation, not metatable setup. We actually have a micro-benchmark for object creation in our test suite that measures the execution time of Number.new call:

local Number = {}
Number.__index = Number

function Number.new(v)
    local self = {
        value = v
    }
    setmetatable(self, Number)
    return self
end

and the bulk of the cost is the garbage collection, followed by object allocation, followed by the check for protected metatables (TIL), and the type checks are only a few % of the total cost.

6 Likes

I do not understand

local t = {}
for i = 1, 10^6 do
    t[math.random()] = math.random()
end

local t0 = tick()
for i, v in pairs(t) do
end
print("a thing:", tick() - t0)

local t0 = tick()
local a, b, c = pairs(t)
for i, v in a, b, c do
end
print("same exact thing:", tick() - t0)

--> a thing: 0.0049343109130859
--> same exact thing: 0.071928024291992

…A good captain goes down with his ship!

4 Likes

The new VM detects pairs in a for loop and generates unique instructions. It’s functionally the same code but it compiles different. See:

4 Likes

It breaks the standard order of operations in Lua. I don’t like this. Thank you for explaining.

I think what I don’t understand is why it would be so much harder to detect if the given iterator function is pure Lua (slow) or not. Examples are pairs and string.gmatch returning next and a C iterator function respectively. Neither are pure Lua (or at least they don’t have to be) and could be optimized.

Additionally, and this is irrelevant but I have to say it anyway, I am not particularly fond of pairs in general as it muddies the meaning of the iterator for loop. Case in point: there are like 5 people who actually know how to use for ... in f, t, i do because pairs, while very idiomatic, hides the meaning.

1 Like

I don’t particularly like it either, especially since it’s an easily broken optimization (at least at the moment). To give you an idea…

local t = {}
for i = 1, 1e6 do
	t[i+1] = i
end

local s0 = tick()
for k, v in pairs(t) do
	
end
print("pairs: ", tick()-s0)

local pairs = pairs

local s1 = tick()
for k, v in pairs(t) do
	
end
print("local pairs: ", tick()-s1)

pairs: 0.0083878040313721
local pairs: 0.043699741363525

Simply localizing pairs makes the code slower by an order of magnitude. You’ll probably want to do quite a bit of refactoring before releasing a game that’s running under the new VM because of things like this.

2 Likes

As I noted above, we optimize based on real-world usecases and we have not seen a single script so far that localized pairs. Tracking assignments through locals in this way, and especially in case @AxisAngle posted, is more tricky so we have not done it.

Code that does localize will not be as much faster but also won’t be any slower than it was, you don’t have to rewrite anything.

5 Likes

If you’ve not encountered any that’s fair enough, I suppose. I’ve seen some scripts that did things like that (and, embarrassingly, written some in the past) so I thought I would mention it since it’s relevant to what AxisAngle posted (wrt to functionally identical code not performing nearly as fast). I’m fully aware that the new VM is being optimized based on real-world use cases and I didn’t mean to rehash that.

If it becomes an issue we can definitely fix basic localization cases - I don’t know if we will be able to always guarantee this. We are very focused on maintaining compatibility and improving performance; for builtins any magical optimizations like this are complicated because they need to preserve behavior in presence of global substitution through getfenv. Having an extra constraint on the optimizations being active in all cases when you’d expect makes some of them impractical.

Having said this, this feedback is useful - we’ll try to preserve the optimization behavior in cases like this in the future although I don’t know to what extent it will be possible (e.g. I’m working on improving the cost of certain builtins like math.sqrt atm and it might be more challenging there compared to the pairs optimizations); if we can’t do it this won’t stop us from improving performance in non-localized cases.

3 Likes

While improving math.sqrt and the more general math.pow, will you also be improving the operator cases as well, such as a^b?

Joke: If not, do you plan on adding math.add?

3 Likes

We do plan to improve the performance of ^ specifically in certain common cases; right now ^0.5 is measurably slower than math.sqrt and x^2 is catastrophically slower than x*x which isn’t great.

I’m afraid we’ve optimized + as much as we can :stuck_out_tongue: although there’s one big change that hasn’t happened yet and might only ship with v2 of new VM later this year, which will make Vector3 math much faster by making Vector3 a basic non-heap allocated type.

10 Likes

When adding to tables without table.create, Lua will often allocate more memory than needed (because it allocates in powers of two.) In most cases I just want to save a bit of memory.
This isn’t a bottleneck in the same way table searching is, but I’ll search my codebase and give a few examples:

  • Creating nested tables of arbitrary size for use with terrain:WriteVoxels()
  • I have a list of parts, and now to build the object later I need to store a corresponding list of cframe offsets.
  • I now want to make a clone of the parts above, so I need to create a table of identical length and move cloned parts into the table. (This seems to be my most common use-case)

There are quite a few other cases, but most of them are only used during my game’s compile process, so they never see the live game.


I found this implementation was fastest, especially for long lists:

return function(list, v, p)
	-- v is the value we're searching for
	-- p is where to start the search (generally the length of the table)
	--  p is decremented as we search

	if list[p] == v then -- Check the top of the list (most common case)
		return p
	end
	p = p - 1 -- We just checked the top of the list, so we subtract 1
	
	-- This code is auto-generated
	
	while p>32 do
		local b,c,d,e,f,g,h,i,j,k,l,m,n,o,q,r,s,t,u,w,x,y,z,a1,a2,a3,a4,a5,a6,a7,a8,a9 = unpack(list,p-31,p)--$const
		if a9==v then return p elseif a8==v then return p-1 elseif a7==v then return p-2 elseif a6==v then return p-3
		elseif a5==v then return p-4 elseif a4==v then return p-5 elseif a3==v then return p-6 elseif a2==v then return p-7
		elseif a1==v then return p-8 elseif z==v then return p-9 elseif y==v then return p-10 elseif x==v then return p-11
		elseif w==v then return p-12 elseif u==v then return p-13 elseif t==v then return p-14 elseif s==v then return p-15
		elseif r==v then return p-16 elseif q==v then return p-17 elseif o==v then return p-18 elseif n==v then return p-19
		elseif m==v then return p-20 elseif l==v then return p-21 elseif k==v then return p-22 elseif j==v then return p-23
		elseif i==v then return p-24 elseif h==v then return p-25 elseif g==v then return p-26 elseif f==v then return p-27
		elseif e==v then return p-28 elseif d==v then return p-29 elseif c==v then return p-30 elseif b==v then return p-31 end
		p=p-32
	end
	if p>16 then
		local b,c,d,e,f,g,h,i,j,k,l,m,n,o,q,r = unpack(list,p-15,p)--$const
		if r==v then return p elseif q==v then return p-1 elseif o==v then return p-2 elseif n==v then return p-3
		elseif m==v then return p-4 elseif l==v then return p-5 elseif k==v then return p-6 elseif j==v then return p-7
		elseif i==v then return p-8 elseif h==v then return p-9 elseif g==v then return p-10 elseif f==v then return p-11
		elseif e==v then return p-12 elseif d==v then return p-13 elseif c==v then return p-14 elseif b==v then return p-15 end
		p=p-16
	end
	
	-- Check the remaining values
	for i = p, 1, -1 do
		if list[i] == v then
			return i
		end
	end
	
	-- Return false so the result can't be used with table.remove
	return false
end

ipairs in the new VM is most likely fast enough to out-compete the unpack batches I’m doing here, but you can’t use ipairs to iterate backwards. I try to set everything up to modify the top of the list more than the bottom to reduce the need to shift everything. I also try to always disconnect things in the reverse order I connected them to reduce the need to search the whole list.

2 Likes

Mhm. Maybe table.create(count, value) would make sense, and if value is nil you will get what you need.

Our general policy for extending Lua libraries is that if a library feature exists in later versions of Lua we’ll just take it but otherwise we will need to evaluate the pros vs cons. We’ll take a look at create/find.

5 Likes

Oohh my gosh yes! I am super excited.

1 Like

@zeuxcg
Are there plans to make rawset(a, b, c), rawget(a, b), or rawequal(a, b) faster than their respective a[b] = c, a[b], and a == b counterparts in the new VM? Right now it’s quite a bit slower.

I almost don’t want this to be the case, because I don’t want to be tempted to write that everywhere.

Here’s the snippet I used to get a rough performance comparison:

local a, b = {nil}, {}

wait()
local t0 = tick()
for i = 1, 1e6 do
	a[1] = true
	a[1] = nil
end
print("newindex (normal)", tick() - t0)

wait()
local t0 = tick()
for i = 1, 1e6 do
	rawset(a, 1, true)
	rawset(a, 1, nil)
end
print("newindex (raw)", tick() - t0)


wait()
local t0 = tick()
for i = 1, 1e6 do
	if a[1] then
		assert(false)
	end
end
print("index (normal)", tick() - t0)

wait()
local t0 = tick()
for i = 1, 1e6 do
	if rawget(a, 1) then
		assert(false)
	end
end
print("index (raw)", tick() - t0)


wait()
local t0 = tick()
for i = 1, 1e6 do
	if a == b then
		assert(false)
	end
end
print("equal (normal)", tick() - t0)

wait()
local t0 = tick()
for i = 1, 1e6 do
	if rawequal(a, b) then
		assert(false)
	end
end
print("equal (raw)", tick() - t0)
newindex (normal) 0.012691259384155
newindex (raw) 0.12145662307739

index (normal) 0.015598297119141
index (raw) 0.066607236862183

equal (normal) 0.013378620147705
equal (raw) 0.058367013931274
2 Likes

While rawset/rawget/rawequal can in theory be faster, in practice:

  1. For most types, they won’t actually be faster; our instructions in VM heavily specialize these operations for all common cases, e.g. tables without metatables
  2. We haven’t seen performance heavy code that relies on this so far
  3. We currently don’t have the infrastructure to optimize builtin calls like this in general (pairs/ipairs is the only example so far where we specialize builtins).

While we do plan to introduce optimizations for builtins, these functions weren’t on the radar so far; we’ll probably start with a small set and extend it over time. It’s unlikely that these operations will be faster than their non-raw equivalents in common cases in the future as well.

4 Likes

I love this. I turned it on and my game ran so much smoother. When I was looking at a lot of polygons my game would lag but for some reason with this turned on it doesn’t anymore. I am working on a RTS, and with this turned on all the units move a lot smoother. Like 30 tanks got laggy on my old mac but they are no longer laggy with this, but 30 tanks and 50 planes is a little bit laggy. (It was extremely laggy with the old VM.) This really improves my games performance a ton.

Using body velocity and body movers is performance heavy, idk if you can make it less heavy since its physics but considering the new vm makes it so much better, I guess theres some way to make body movers perform better. Also constantly looping through a hundred units every second so that a unit can find its closest target could have room for improvement. Thats pretty much a hundred units each looping through a hundred units a second which is like 10000 loops a second? Maybe that is something to improve.

1 Like

I think you may have gotten such a massive performance increase because of the optimizations they have been doing to loops and such. Glad to see it helped with your game.

1 Like

Will table.insert(tbl, value) be getting any special optimizations like for k, v in pairs(tbl) do has? Right now, tbl[#tbl + 1] = value is faster than table.insert(tbl, value), even on the new VM.

Some might argue that tbl[#tbl + 1] = value should be faster, but by the same logic, next, tbl should be faster than pairs(tbl). I think that a fast “table insert” instruction – triggered by the table.insert(tbl, value) syntax – would be a nice optimization.

11 Likes

Suppose I have two scripts, where the first script uses getfenv/setfenv on the environment of the second script. In the new VM, will both scripts have the relevant optimizations disabled? Or perhaps only the first, where getfenv/setfenv are actually used?

Also, would this behavior change at all when using ModuleScripts?

1 Like

I think that’s why because when I turned off the unit script meaning the loops stop, performance increased a ton so I guess that most laggy part of my game is the looping.

1 Like