I’m hoping to better understand some of the best practices around parallel Luau, primarily with respect to SharedTables.
One thing I’ve noticed while using SharedTables
is that they take significantly longer to write to, which is understandable given that there is likely some form of memory replication across the various VMs involved in the parallelization process.
Generally when writing to any part of a shared table I see a minimum of 0.003ms
per write up to 0.01ms
in my experience, that may not sound like much, but it adds up a lot when you’re trying to perform heavy processing across several threads. I’ve found that writing to a standard temp table and then overwriting a larger chunk of a SharedTable
to be a bit faster, but there’s some added overhead of “cloning” the data in the SharedTable
. Ex:
local st = SharedTable.new()
st.test = {}
st.test.x = 1
st.test.y = 2
st.test.subTable = {
x2 = 1,
y2 = 1,
} -- This gets converted into a shared table automatically from what I can tell.
local st_clone = {
subTable = {}
};
while task.wait() do
debug.profilebegin('shared_table_test')
st.test.x += 1
st.test.y += 1
st.test.subTable.x2 += 1
st.test.subTable.y2 += 1
debug.profileend()
debug.profilebegin('shared_table_overwrite_test')
st_clone.x = st.test.x + 1
st_clone.y = st.test.y + 1
st_clone.subTable.x2 = st.test.subTable.x2 + 1
st_clone.subTable.y2 = st.test.subTable.y2 + 1
st.test = st_clone
debug.profileend()
end
The microprofiler then shows the following:
shared table direct write test
shared table clone and overwrite test
The first example shows that the time taken to update the shared table directly, versus cloning its data and overwriting the shared table altogether is faster. It also correlates with my findings that each write takes approximately 0.003ms
where in the first example we’re writing to a SharedTable
4 times and thus ~0.012ms (in this case 0.013ms) while in the clone and overwrite example we see 0.006ms
where we’re only writing to the shared table one time but there’s some overhead in cloning the data.
When increasing the number of writes and values to clone this “optimization” does continue to yield faster results, but the “cloning” overhead doesn’t seem to scale all that well. It ends up being about a 25% improvement over direct writes in my experience.
However, in the case that you need to write to the same value potential many times, or really just more than once, this “optimization” really shines. Writing to a standard table is likely 10s of times faster than writing to a SharedTable
so in this example:
local st = SharedTable.new()
st.test = {}
st.test.x = 1
st.test.y = 2
st.test.subTable = {
x2 = 1,
y2 = 1,
} -- This gets converted into a shared table automatically from what I can tell.
local st_clone = {
subTable = {}
};
while task.wait() do
debug.profilebegin('shared_table_test')
st.test.x += 1
st.test.y += 1
st.test.x += 1
st.test.y += 1
st.test.x += 1
st.test.y += 1
st.test.x += 1
st.test.y += 1
st.test.subTable.x2 += 1
st.test.subTable.y2 += 1
st.test.subTable.x2 += 1
st.test.subTable.y2 += 1
st.test.subTable.x2 += 1
st.test.subTable.y2 += 1
st.test.subTable.x2 += 1
st.test.subTable.y2 += 1
debug.profileend()
debug.profilebegin('shared_table_overwrite_test')
st_clone.x = st.test.x + 1
st_clone.y = st.test.y + 1
st_clone.x += 1
st_clone.y += 1
st_clone.x += 1
st_clone.y += 1
st_clone.x += 1
st_clone.y += 1
st_clone.subTable.x2 = st.test.subTable.x2 + 1
st_clone.subTable.y2 = st.test.subTable.y2 + 1
st_clone.subTable.x2 += 1
st_clone.subTable.y2 += 1
st_clone.subTable.x2 += 1
st_clone.subTable.y2 += 1
st_clone.subTable.x2 += 1
st_clone.subTable.y2 += 1
st.test = st_clone
debug.profileend()
end
We see far greater results in cloning the SharedTable
and overwriting it at the end.
See Microprofiler results:
shared table direct write test
shared table clone and overwrite test
This shows that cloning the table in this case yields nearly 325% faster results than directly writing to the table. It also doesn’t quite agree with my previous finding that a write to a SharedTable
takes approximately 0.003ms
, I’m not entirely sure why that is, perhaps someone might have some insight into that.
I’m sure these times vary quite a bit across different CPUs but it’s been extremely consistent from my testing on my own PC.
Perhaps the Actor Messaging API is faster? I really doubt it is, but I have yet to test this.
If any of you have more experience with Parallel Luau than I do (and I don’t have much) then please feel free to chime in and help drive this conversation!
I’m really curious to learn more about Parallel Luau as it is today in Roblox and