Lua Signal Class Comparison & Optimal `GoodSignal` Class

This is incorrect, and actually the opposite on Roblox! Using an array as the backing for your object instead of a map of fields is not only wildly unreadable but also slower as well!

This is because the Roblox Luau VM pre-caches the index into the hashmap that will be needed for field accesses, so most field accesses also only need to effectively do a single array lookup, while a numeric array lookup is slower because it needs to check if the type of the index is a number or different type to tell whether to index the array part or not (vs a field lookup immediately knows that it should look in the hash part).

2 Likes

I was wondering if it’s possible to add :ConnectParallel properly? I tried, I think it worked, not sure… Trying to use unsafe functions gave me an error, but I honestly can’t tell if it’s even in parallel.

I tried it like this:

Just want to clarify a few things:

This is because the Roblox Luau VM pre-caches the index into the hashmap that will be needed for field accesses,

    • This is nothing special actually. In both Luau and Lua, string hashes are computed and cached when interned, and due to this, a hash map lookup doesn’t need to calculate the hash for the specific key.

because it needs to check if the type of the index is a number or different type to tell whether to index the array part or not

    • Strictly speaking of the implementation of tables in Lua, an array will be considered an dictionary by Lua if it has explicit keys like the ones down below:
local t = {[1] = 1, [2] = 2} --  Lua thinks this is an dictionary, whereas
-- in reality, its an array. 
    • The type of index have to be checked in both array and hashmap lookups, not only for array lookups.


What I said is mostly relevant in Lua, however we don’t know much about Luau’s internals. It’s safe to say that the first point is relevant in both. However, the second and third may not be.

1 Like

Hiya, could you elaborate on what you mean by this? Running this benchmark

local array = {}
local dict = {}

for i = 1, 1e5 do
	table.insert(array, i)
	dict['a' .. i] = i
end

return {
	ParameterGenerator = function()
		return
	end,
	Functions = {
		array = function()
			for _ = 1, 1e5 do
				local _ = array[1e5]
			end
		end,
		dict = function()
			for _ = 1, 1e5 do
				local _ = dict.a1
			end
		end
	}
}

results in this:

1 Like

Your benchmark needs some work. There’s a few factors confounding the result:

  1. Your example is contrived because of the large number of fields in the object. No object in people’s code actually has 10000 unique fields in it and this actually matters too because field access is optimized for a small number of fields, so you’re hitting the bad case of field access time when you have so many fields.

  2. You’re not caching the upvalue reference, and accessing that is going to be more than 50% of the total work the loop iterations are doing, so that will pollute the results with a lot of noise:

  1. You need to do way more work in each iteration of the loop. Simply running the loop itself is taking a lot of the time right now, again causing issues for the measurement.

Here’s my version of your benchmark which shows a dict based 4-element vector (4 because my connection class has 4 fields) being significantly faster than an array based one (with a transposed version of the access to show that sequential vs non-sequential access is not a relevant factor with this small struct size):

Though I should also note that at a very small object size and an operation as fast as field access, the details of how field access works are extremely sensitive, so you have to be pretty careful with how you benchmark.

1 Like

Just to extinguish any doubt, I wrote a converted the module over to use arrays internally instead of fields and used the same benchmarking tool on it with this benchmark code:

The dict based signal wins over the array based signal in both cases just like it did with my benchmarking when I was initially writing it, so the result from the above benchmark carries over to practictical scenarios as well.

Okay, in detail, what Roblox does is slightly more than this. It actually figures out what most likely hash bucket to look in is ahead of time and stores that information directly in the VM instruction, it doesn’t simply have the string hash pre-computed like normal Lua. Sorry for not very accurate explanation.

The fast-path stuff like expected-hash-bucket based indexing all happens right in the core VM loop before things even get down into to the core slow-path functions like luaH_get.

Okay, in detail, what Roblox does is slightly more than this. It actually figures out what most likely hash bucket to look in is ahead of time and stores that information directly in the VM instruction, it doesn’t simply have the string hash pre-computed like normal Lua. Sorry for not very accurate explanation.

This is incorrect. In the case of Lua, it doesn’t store any of the information in a VM instruction and nowhere in Luau is documented that the information is stored in a VM. The only thing different is that Luau inline caches tables for field lookups. Take a look at the header for string objects:

The hash bucket is literally the hash of the key, each hash bucket is a linked list. When luaH_get calls luaH_getStr (for a key which is a string), it already gets the precomputed hash of the key and iterates through a linked list, searching nodes and seeing if the key stored inside each node is a string and is equivalent to the key whose value you need to find.

The fast-path stuff like expected-hash-bucket based indexing all happens right in the core VM loop before things even get down into to the core slow-path functions like luaH_get .

This is incorrect, all of the actual work happens in these 2 functions which luaH_get calls one of them depending on the type of key. You are most likely assuming information out your own at this point.


PS: It is getting very off topic here, much better to continue in PMS if you’d like.

1 Like

That’s for vanilla Lua, all of what we’re talking about here is specific to the Roblox Luau VM, which is significantly diverged from the vanilla Lua one. It goes as far as having modified and additional opcodes in the bytecode format as well as having a completely rewritten core interpreter loop.

Yes, in vanilla Lua using an array is basically always faster than a struct with actual fields, but as you can see from the benchmark above that’s not true for the Roblox Luau VM because of the things I mentioned.

That is correct, but you never cited any of your sources for saying that Luau stores that information inside a VM instruction (which seems redundant).

Yes, in vanilla Lua using an array is basically always faster than a struct with actual fields, but as you can see from the benchmark above that’s not true for the Roblox Luau VM because of the things I mentioned.

Yup, I cited in terms of Lua. Luau does some voodo magic for table field lookups as well as inline caching.

Sorry for the lack of citations, that’s me being a bit lazy. The info is spread buried deep inside many Luau Recap threads / release notes, so it’s pretty hard to dig up.

I thought it would be useful info regardless even with a citation, because many people still mucking up their code under the misconception that it will be faster if they use arrays where they should be using proper structs.

2 Likes

I see! so, to clarify - if I for instance have something of the like:

for _, Data in next, SomeTable do
     local Math = Data[1] * Time + Data[2] / Data[3] -- etc
end

would it be better to convert the Data into a dictionary, then? I am iterating through ~20,000 elements and indexing at least 2-3 properties for each one, so would it be more worthwhile for me to make Data a dictionary instead here?

Depends what your needs are.

If what you care about is perf, then for a 3-5 element item fields should beat an array (IIRC for a 2 element item using an array still wins for some reason, somehow the bucket logic must not work out in that case)

If what you care most about above all else memory and you have specifically 3 or less numeric fields, then you should actually put the three values into a Vector3: Since Vector3s are now a native type, you will avoid doing any memory allocation of a table grouping them and the numeric data will just be stored directly in the main table. This also works for compressing an array of plain numbers by a factor of 3 at the expense of it being slower to access them.

1 Like

Doesn’t coroutine.wrap do the same? Where does the performance gain come from?

Yes… but there’s other reasons to not use coroutine.resume (if you want to re-use a coroutine you have to use a separate coroutine.resume, not coroutine.wrap): You can’t fully correctly implement event:Wait() using coroutine.resume, and it also doesn’t integrate as well with the output window and debugger in the case where the code in the coroutine encounters an error.

2 Likes

I am a bit confused on why you said we “can’t fully correctly implement event:Wait()” with coroutines. Could you elaborate on why this is the case?

Using coroutine.resume makes errors disappear.
And also has some other hard to explain issues which I haven’t got to understanding yet. And I have no reason to, since now task.spawn is so much better for everything, that coroutine.resume is only used at really specific scenearios.

He didn’t say you can’t implement it with coroutines, the task library is one of the main ways to mess with coroutines, it’s just different and has less problems.
Just because it isn’t inside the coroutine library, doesn’t mean it’s not coroutine-related.

task.spawn(coroutine, ...) resumes a coroutine (better named thread), but has none of the issues that coroutine.resume has.

I have been using your GoodSignal class for a while now and it’s been a breeze to use. However, it seems to keep a strong reference to variables passed as parameters after firing and is only kept if the signal is connected. Here’s some code that displays this behavior.

local ref = setmetatable({},{__mode="v"})

local Signal = require(script.Parent.GoodSignal)

local Object = Instance.new("Part")
table.insert(ref,Object)

local Event = Signal.new()
Event:Connect(function()end)

Event:Fire(Object)
Object = nil

print(ref[1])
wait(2)
print(ref[1])

Output:
image


To compare, I modified the code slightly to make use of SimpleSignal instead and the Part was able to be GC’ed as expected.

local ref = setmetatable({},{__mode="v"})

local Signal = require(script.Parent.SimpleSignal)

local Object = Instance.new("Part")
table.insert(ref,Object)

local Event = Signal.new()
Event:Connect(function()end)

Event:Fire(Object)
Object = nil

print(ref)
wait(2)
print(ref)

Output:
image


Is this behavior related to how the coroutines that run the handlers are recycled? Is this intended?

Oh boy, that is one subtle bug, nice catch! It is indeed unintional and here’s why it happens:

-- Coroutine runner that we create coroutines of. The coroutine can be 
-- repeatedly resumed with functions to run followed by the argument to run
-- them with.
local function runEventHandlerInFreeThread(...)
    acquireRunnerThreadAndCallEventHandler(...)
    while true do
        acquireRunnerThreadAndCallEventHandler(coroutine.yield())
    end
end

Can you spot the bug? Hint: It’s in the ....

If none of your handlers ever actually yields, it will keep reusing the same runEventHandlerInFreeThread thread… and the ... argument of that thread remains on the stack even through the subsequent handler calls, so anything in the ... can’t be garbage collected until one of the handlers yields, requiring the creation of a new runEventHandlerInFreeThread thread, and resulting in the eventual garbage collection of the old thread.

So basically, specifically the first set of arguments you pass to a :Fire call in your game after a handler yields will be held onto until the next time that any handler yields.

I’m not sure what to do about this. It’s a pretty obscure edge case I don’t think will ever actually come up in practice and even when it does it would only leak a small fixed amount of memory. Since solving it will incur a slight performance penalty I’m hesitant to do it with there being so little chance of it mattering. If you want to solve it, you can change runEventHandlerInFreeThread to this:

local function runEventHandlerInFreeThread()
	while true do
		acquireRunnerThreadAndCallEventHandler(coroutine.yield())
	end
end

And the creation of the freeRunnerThread to this:

freeRunnerThread = coroutine.create(runEventHandlerInFreeThread)
coroutine.resume(freeRunnerThread) --> Add this

To basically “clear” the initial step of the coroutine and only invoke handlers via the yield.

Kinda unfortunate that we can’t load LVM bytecode… if we could I would be able to solve this via LVM assembly at no perf cost.

1 Like

I’ve actually fixed this on my Lua implementation, and I think it’s worth it to change the behaviour for this on GoodSignal.

It only comes at a really, really small cost that is even more insignificant given it only costs more when creating a new thread, which again only happens when a handler errors or yields, which if you’re yielding then you probably don’t care about the performance cost anyhow.

This little aspect might cause confusion for users, even if it’s a rare edge case, they could be losing a lot of time trying to debug such a hidden issue.

Edit: Made a pull request to make your life easier :)


Also, the GoodSignal links here are not pointing to the repo and are still pointing to the gist. Would you mind fixing that?

2 Likes