Faster Lua VM: Studio beta

As I noted above, we optimize based on real-world usecases and we have not seen a single script so far that localized pairs. Tracking assignments through locals in this way, and especially in case @AxisAngle posted, is more tricky so we have not done it.

Code that does localize will not be as much faster but also won’t be any slower than it was, you don’t have to rewrite anything.

5 Likes

If you’ve not encountered any that’s fair enough, I suppose. I’ve seen some scripts that did things like that (and, embarrassingly, written some in the past) so I thought I would mention it since it’s relevant to what AxisAngle posted (wrt to functionally identical code not performing nearly as fast). I’m fully aware that the new VM is being optimized based on real-world use cases and I didn’t mean to rehash that.

If it becomes an issue we can definitely fix basic localization cases - I don’t know if we will be able to always guarantee this. We are very focused on maintaining compatibility and improving performance; for builtins any magical optimizations like this are complicated because they need to preserve behavior in presence of global substitution through getfenv. Having an extra constraint on the optimizations being active in all cases when you’d expect makes some of them impractical.

Having said this, this feedback is useful - we’ll try to preserve the optimization behavior in cases like this in the future although I don’t know to what extent it will be possible (e.g. I’m working on improving the cost of certain builtins like math.sqrt atm and it might be more challenging there compared to the pairs optimizations); if we can’t do it this won’t stop us from improving performance in non-localized cases.

3 Likes

While improving math.sqrt and the more general math.pow, will you also be improving the operator cases as well, such as a^b?

Joke: If not, do you plan on adding math.add?

3 Likes

We do plan to improve the performance of ^ specifically in certain common cases; right now ^0.5 is measurably slower than math.sqrt and x^2 is catastrophically slower than x*x which isn’t great.

I’m afraid we’ve optimized + as much as we can :stuck_out_tongue: although there’s one big change that hasn’t happened yet and might only ship with v2 of new VM later this year, which will make Vector3 math much faster by making Vector3 a basic non-heap allocated type.

10 Likes

When adding to tables without table.create, Lua will often allocate more memory than needed (because it allocates in powers of two.) In most cases I just want to save a bit of memory.
This isn’t a bottleneck in the same way table searching is, but I’ll search my codebase and give a few examples:

  • Creating nested tables of arbitrary size for use with terrain:WriteVoxels()
  • I have a list of parts, and now to build the object later I need to store a corresponding list of cframe offsets.
  • I now want to make a clone of the parts above, so I need to create a table of identical length and move cloned parts into the table. (This seems to be my most common use-case)

There are quite a few other cases, but most of them are only used during my game’s compile process, so they never see the live game.


I found this implementation was fastest, especially for long lists:

return function(list, v, p)
	-- v is the value we're searching for
	-- p is where to start the search (generally the length of the table)
	--  p is decremented as we search

	if list[p] == v then -- Check the top of the list (most common case)
		return p
	end
	p = p - 1 -- We just checked the top of the list, so we subtract 1
	
	-- This code is auto-generated
	
	while p>32 do
		local b,c,d,e,f,g,h,i,j,k,l,m,n,o,q,r,s,t,u,w,x,y,z,a1,a2,a3,a4,a5,a6,a7,a8,a9 = unpack(list,p-31,p)--$const
		if a9==v then return p elseif a8==v then return p-1 elseif a7==v then return p-2 elseif a6==v then return p-3
		elseif a5==v then return p-4 elseif a4==v then return p-5 elseif a3==v then return p-6 elseif a2==v then return p-7
		elseif a1==v then return p-8 elseif z==v then return p-9 elseif y==v then return p-10 elseif x==v then return p-11
		elseif w==v then return p-12 elseif u==v then return p-13 elseif t==v then return p-14 elseif s==v then return p-15
		elseif r==v then return p-16 elseif q==v then return p-17 elseif o==v then return p-18 elseif n==v then return p-19
		elseif m==v then return p-20 elseif l==v then return p-21 elseif k==v then return p-22 elseif j==v then return p-23
		elseif i==v then return p-24 elseif h==v then return p-25 elseif g==v then return p-26 elseif f==v then return p-27
		elseif e==v then return p-28 elseif d==v then return p-29 elseif c==v then return p-30 elseif b==v then return p-31 end
		p=p-32
	end
	if p>16 then
		local b,c,d,e,f,g,h,i,j,k,l,m,n,o,q,r = unpack(list,p-15,p)--$const
		if r==v then return p elseif q==v then return p-1 elseif o==v then return p-2 elseif n==v then return p-3
		elseif m==v then return p-4 elseif l==v then return p-5 elseif k==v then return p-6 elseif j==v then return p-7
		elseif i==v then return p-8 elseif h==v then return p-9 elseif g==v then return p-10 elseif f==v then return p-11
		elseif e==v then return p-12 elseif d==v then return p-13 elseif c==v then return p-14 elseif b==v then return p-15 end
		p=p-16
	end
	
	-- Check the remaining values
	for i = p, 1, -1 do
		if list[i] == v then
			return i
		end
	end
	
	-- Return false so the result can't be used with table.remove
	return false
end

ipairs in the new VM is most likely fast enough to out-compete the unpack batches I’m doing here, but you can’t use ipairs to iterate backwards. I try to set everything up to modify the top of the list more than the bottom to reduce the need to shift everything. I also try to always disconnect things in the reverse order I connected them to reduce the need to search the whole list.

2 Likes

Mhm. Maybe table.create(count, value) would make sense, and if value is nil you will get what you need.

Our general policy for extending Lua libraries is that if a library feature exists in later versions of Lua we’ll just take it but otherwise we will need to evaluate the pros vs cons. We’ll take a look at create/find.

5 Likes

Oohh my gosh yes! I am super excited.

1 Like

@zeuxcg
Are there plans to make rawset(a, b, c), rawget(a, b), or rawequal(a, b) faster than their respective a[b] = c, a[b], and a == b counterparts in the new VM? Right now it’s quite a bit slower.

I almost don’t want this to be the case, because I don’t want to be tempted to write that everywhere.

Here’s the snippet I used to get a rough performance comparison:

local a, b = {nil}, {}

wait()
local t0 = tick()
for i = 1, 1e6 do
	a[1] = true
	a[1] = nil
end
print("newindex (normal)", tick() - t0)

wait()
local t0 = tick()
for i = 1, 1e6 do
	rawset(a, 1, true)
	rawset(a, 1, nil)
end
print("newindex (raw)", tick() - t0)


wait()
local t0 = tick()
for i = 1, 1e6 do
	if a[1] then
		assert(false)
	end
end
print("index (normal)", tick() - t0)

wait()
local t0 = tick()
for i = 1, 1e6 do
	if rawget(a, 1) then
		assert(false)
	end
end
print("index (raw)", tick() - t0)


wait()
local t0 = tick()
for i = 1, 1e6 do
	if a == b then
		assert(false)
	end
end
print("equal (normal)", tick() - t0)

wait()
local t0 = tick()
for i = 1, 1e6 do
	if rawequal(a, b) then
		assert(false)
	end
end
print("equal (raw)", tick() - t0)
newindex (normal) 0.012691259384155
newindex (raw) 0.12145662307739

index (normal) 0.015598297119141
index (raw) 0.066607236862183

equal (normal) 0.013378620147705
equal (raw) 0.058367013931274
2 Likes

While rawset/rawget/rawequal can in theory be faster, in practice:

  1. For most types, they won’t actually be faster; our instructions in VM heavily specialize these operations for all common cases, e.g. tables without metatables
  2. We haven’t seen performance heavy code that relies on this so far
  3. We currently don’t have the infrastructure to optimize builtin calls like this in general (pairs/ipairs is the only example so far where we specialize builtins).

While we do plan to introduce optimizations for builtins, these functions weren’t on the radar so far; we’ll probably start with a small set and extend it over time. It’s unlikely that these operations will be faster than their non-raw equivalents in common cases in the future as well.

4 Likes

I love this. I turned it on and my game ran so much smoother. When I was looking at a lot of polygons my game would lag but for some reason with this turned on it doesn’t anymore. I am working on a RTS, and with this turned on all the units move a lot smoother. Like 30 tanks got laggy on my old mac but they are no longer laggy with this, but 30 tanks and 50 planes is a little bit laggy. (It was extremely laggy with the old VM.) This really improves my games performance a ton.

Using body velocity and body movers is performance heavy, idk if you can make it less heavy since its physics but considering the new vm makes it so much better, I guess theres some way to make body movers perform better. Also constantly looping through a hundred units every second so that a unit can find its closest target could have room for improvement. Thats pretty much a hundred units each looping through a hundred units a second which is like 10000 loops a second? Maybe that is something to improve.

1 Like

I think you may have gotten such a massive performance increase because of the optimizations they have been doing to loops and such. Glad to see it helped with your game.

1 Like

Will table.insert(tbl, value) be getting any special optimizations like for k, v in pairs(tbl) do has? Right now, tbl[#tbl + 1] = value is faster than table.insert(tbl, value), even on the new VM.

Some might argue that tbl[#tbl + 1] = value should be faster, but by the same logic, next, tbl should be faster than pairs(tbl). I think that a fast “table insert” instruction – triggered by the table.insert(tbl, value) syntax – would be a nice optimization.

11 Likes

Suppose I have two scripts, where the first script uses getfenv/setfenv on the environment of the second script. In the new VM, will both scripts have the relevant optimizations disabled? Or perhaps only the first, where getfenv/setfenv are actually used?

Also, would this behavior change at all when using ModuleScripts?

1 Like

I think that’s why because when I turned off the unit script meaning the loops stop, performance increased a ton so I guess that most laggy part of my game is the looping.

1 Like

In order to reduce the traffic on this thread, please only use this thread to report issues with new VM; we’re also starting to collect a list of popular places to do a player-facing beta test, so if you’d like the new VM to be enabled on desktop clients and server, please let me know what your place is; we expect to enable the VM for these individual places next week after the release.

If you have feature requests for Lua, you can use the feature request forum category.

14 Likes

Could you enable it for this place?
https://www.roblox.com/games/1291731876/Arrival-Testing

Idk if this is off-topic or not, but the new vm has been enabled in the first game: https://www.roblox.com/games/1720015937/Space-Mining-Simulator

Thought I’d let everyone know /shrug

(fflags log useful)

5 Likes

I tested the new Lua VM with an implementation of SHA-256, utilizing the upcoming bit32 library.
Using an input of 260 KB, this is how long it took to compute their hashes:

  • Old Lua VM: 1.2972564697266
  • New Lua VM: 0.17563724517822

These results are quite astonishing! I can’t wait for this to go live!


SHA-256 Implementation

This is just a cleaned up version of an implementation I found on GitHub Gist.

local band = bit32.band
local bnot = bit32.bnot
local bxor = bit32.bxor

local rrotate = bit32.rrotate
local rshift = bit32.rshift

local primes = 
{
	0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5,
	0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
	0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
	0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
	0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc,
	0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
	0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7,
	0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
	0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
	0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
	0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3,
	0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
	0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5,
	0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
	0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
}

local function toHex(str)
	local result = str:gsub('.', function (char)
		return string.format("%02x", char:byte())
	end)
	
	return result
end

local function toBytes(value, length)
    local str = ""
	
    for i = 1, length do
        local rem = value % 256
        str = string.char(rem) .. str
        value = (value - rem) / 256
    end
	
    return str
end

local function readInt32(buffer, index)
    local value = 0
	
    for i = index, index + 3 do 
		value = (value * 256) + string.byte(buffer, i)
	end
	
    return value
end

local function digestBlock(msg, i, hash)
	local digest = {}
	
	for j = 1, 16 do 
		digest[j] = readInt32(msg, i + (j - 1) * 4) 
	end
	
	for j = 17, 64 do
		local v = digest[j - 15]
		local s0 = bxor(rrotate(v, 7), rrotate(v, 18), rshift(v, 3))
		
		v = digest[j - 2]
		digest[j] = digest[j - 16] + s0 + digest[j - 7] + bxor(rrotate(v, 17), rrotate(v, 19), rshift(v, 10))
	end
	
	local a, b, c, d, e, f, g, h = unpack(hash)
	
	for i = 1, 64 do
		local s0 = bxor(rrotate(a, 2), rrotate(a, 13), rrotate(a, 22))
		local maj = bxor(band(a, b), band(a, c), band(b, c))
		
		local t2 = s0 + maj
		local s1 = bxor(rrotate(e, 6), rrotate(e, 11), rrotate(e, 25))
		
		local ch = bxor(band(e, f), band(bnot(e), g))
		local t1 = h + s1 + ch + primes[i] + digest[i]
		
		h, g, f, e, d, c, b, a = g, f, e, d + t1, c, b, a, t1 + t2
	end
	
	hash[1] = band(hash[1] + a)
	hash[2] = band(hash[2] + b)
	hash[3] = band(hash[3] + c)
	hash[4] = band(hash[4] + d)
	hash[5] = band(hash[5] + e)
	hash[6] = band(hash[6] + f)
	hash[7] = band(hash[7] + g)
	hash[8] = band(hash[8] + h)
end

local function sha256(msg)
	do
		local extra = 64 - ((#msg + 9) % 64)
		local len = toBytes(8 * #msg, 8)
		
		msg = msg .. '\128' .. string.rep('\0', extra) .. len
		assert(#msg % 64 == 0)
	end
	
	local hash = 
	{
		0x6a09e667,
		0xbb67ae85,
		0x3c6ef372,
		0xa54ff53a,
		0x510e527f,
		0x9b05688c,
		0x1f83d9ab,
		0x5be0cd19,	
	}
	
	for i = 1, #msg, 64 do 
		digestBlock(msg, i, hash)
	end
	
	local result = ""
	
	for i = 1, 8 do
		local value = hash[i]
		result = result .. toBytes(value, 4)
	end
	
	return toHex(result)
end

------------------------------------------------------------------

local input = string.rep(".", 26e4)

local now = tick()
local result = sha256(input)

print(tick() - now)
print(result)

------------------------------------------------------------------
15 Likes

Where did you enable the bit32 library? I want to use that for something I made a while back, but was too clunky and slow to be of any use. Asking here in case others want to know how to.