Release Notes for 435

Tomarty · June 25, 2020, 9:01pm

I was actually going to use a global table for all parts, call BulkMoveTo at the end of the frame, then clear the table. This approach seems best for updating unknown quantities of arbitrary parts within small/medium assemblies, as well as within my foliage system where hundreds of simple animated objects are stored in an intrusive-doubly-linked-list and updated/throttled with a fixed time budget.

Here’s what the system I’m ready to test looks like:

local GlobalBulkMoveToCount = {0}
local GlobalBulkMoveToPartList = {}
local GlobalBulkMoveToCFrameList = {}

game:GetService("RunService").RenderStepped:Connect(function()
	
	-- [Update characters and UI here]
	
	-- End of frame:
	local count = GlobalBulkMoveToCount[1]--$const
	if count >= 1 then
		GlobalBulkMoveToCount[1] = 0 -- Reset counter
		
		workspace:BulkMoveTo(GlobalBulkMoveToPartList, GlobalBulkMoveToCFrameList, Enum.BulkMoveMode.FireNoEvents)
		
		for i = 1, count do -- Clear tables
			GlobalBulkMoveToPartList[i] = nil
			GlobalBulkMoveToCFrameList[i] = nil
		end
	end
end)

Here’s how a part is added:

local count = GlobalBulkMoveToCount[1] + 1
GlobalBulkMoveToCount[1] = count
GlobalBulkMoveToPartList[count] = part
GlobalBulkMoveToCFrameList[count] = partCFrame

Here’s an overview of how my skeletal animation system works and how it would use it:

local GetComponents = CFrame.new().GetComponents

-- This creates a fast bone update function for its specific state. This is cheap compared to how many times it will be called on average.
local createFastBoneUpdateFunction = function(bone)
	local getBoneOffset = bone[2] -- This is a function that returns the bone's current offset cframe.
	local partList = bone[3] -- The list of parts connected to this bone and their corresponding offsets. Stored {part1, offset1, part2, offset2, ...}
	local staticFunctionList = bone[4] -- A list of functions connected to this bone, that only need to be updated when the bone's cframe changes. Primarily used by bones with a fixed offset.
	local mobileFunctionList = bone[5] -- A list of functions connected to this bone that need to be called even if the bone's cframe doesn't change. Primarily used by bones that are currently animating.
	
	-- There are a few hundred common cases with auto-generated source code. Here's a documented example:
	if #partList == 2 and #staticFunctionList == 1 and #mobileFunctionList == 1 then
		
		local part1, offset1, staticFunction1, mobileFunction1 = partList[1], partList[2], staticFunctionList[1], mobileFunctionList[1]
		
		return function(cframe, positionThreshold, rotationThreshold)
			-- 'cframe' is the parent bone's resulting CFrame, with the matrix transformed by CFrame.new(0,0,0, scaleFactor,0,0, 0,scaleFactor,0, 0,0,scaleFactor)
			
			cframe = cframe * getBoneOffset() -- Animate the cframe
			
			-- The thresholds are used to reduce the number of part updates, especially for characters that are far away or recently off-screen.
			--  positionThreshold = (basePositionThresholdInStuds * scaleFactor)^2
			--  potationThreshold = math.cos(rotationThresholdInRadians) * (scaleFactor^2)
			
			-- Here we test to see if the cframe has changed enough to update.
			--  This is huge optimization for characters that are standing still, but would benefit an API like 'cframe:FuzzyEq(other, positionThreshold, matrixThreshold)'
			
			local px0, py0, pz0, xx0, yx0, zx0, xy0, yy0, zy0 = GetComponents(bone[1]) -- bone[1] is the bone's last updated cframe
			local px1, py1, pz1, xx1, yx1, zx1, xy1, yy1, zy1 = GetComponents(cframe)
			if (px1 - px0)^2 + (py1 - py0)^2 + (pz1 - pz0)^2 > positionThreshold or xx0*xx1 + yx0*yx1 + zx0*zx1 < rotationThreshold or xy0*xy1 + yy0*yy1 + zy0*zy1 < rotationThreshold then
				bone[1] = cframe -- Set the last updated cframe
				
				staticFunction1(cframe, positionThreshold, rotationThreshold) -- Update staticFunctionList 
				
				-- Update partList
				local count = GlobalBulkMoveToCount[1] + 1
				GlobalBulkMoveToCount[1] = count
				GlobalBulkMoveToPartList[count] = part1
				GlobalBulkMoveToCFrameList[count] = cframe * offset1
			end
			
			mobileFunction1(cframe, positionThreshold, rotationThreshold) -- Update mobileFunctionList
		end
	end
end

Tomarty · June 25, 2020, 9:34pm

I did a recent benchmark and found that foo[#foo + 1] = bar is still faster than table.insert(foo, bar), even when foo has a metatable:

table.insert(foo, bar) can take up to 70% longer than len += 1; foo[len] = bar, so incrementing the length yourself is still fastest.

EDIT: I accidentally left my distributed profiler open all day and now I have 3 billion iterations for each test spread across dozens of clients that followed me into the server:

This is what the tests look like if someone’s curious:

-- reallocate/4/table.insert
Test = function(count, spoof)
	local v = spoof() -- This is nil
	local v1 = spoof(true) -- This is true
	local tick0 = os.clock()
	for _ = 1, count do
		local t = {}
		if v then -- This never runs
			t = spoof(t, v)
		end
		table.insert(t, v1)
		table.insert(t, v1)
		table.insert(t, v1)
		table.insert(t, v1)
	end
	local tick1 = os.clock()
	return tick1 - tick0
end,
TestControl = function(count, spoof)
	local v = spoof() -- This is nil
	local tick0 = os.clock()
	for _ = 1, count do
		local t = {}
		if v then -- This never runs, but ensures the above expression isn't compiled-away because it has no side-effects
			t = spoof(t, v) 
		end
	end
	local tick1 = os.clock()
	return tick1 - tick0
end

tnavarts · June 25, 2020, 9:36pm

FYI: I had to remove the FireNoEvents option in v437, it was not stable enough. It turns out that in some edge cases the rendering system does need CFrame changed to fire in order to not get graphics artifacts. Fortunately you should still be able to capture enough performance improvement to make it worth it even with FireCFrameChanged.

		for i = 1, count do -- Clear tables
			GlobalBulkMoveToPartList[i] = nil
			GlobalBulkMoveToCFrameList[i] = nil
		end

That’s the part I’m talking about. Try changing this to:

	GlobalBulkMoveToPartList = table.create(512)
	GlobalBulkMoveToCFrameList = table.create(512)

With whatever number comfortably fits your typical part load. I suspect that reallocating like that will actually be faster than looping through the whole table to clear it.

Hexcede · June 25, 2020, 10:33pm

I haven’t heard anywhere that it invokes __len I just know that it invokes a length check and I assumed it involves using the same internal mechanism that the # operator uses (in the docs it states that the default value is equivalent to “#table” which is why I’d assumed this in the first place) and it’d use __len in that case. When I had asked @zeuxcg prior he had said tables don’t store an array length counter internally already which I’d suggested would likely be an optimization for table.insert because I’d assumed my efficiency issue was related to the length mechanism being used by the function (If I’m entirely wrong about the slight efficiency issue I’ve experienced with table.insert being related to the length mechanism lmk).

I also believe that it isn’t the case that it’s O(1) to make a table.insert call because again in my benchmarks I noticed table.insert was adding a (relatively) significant slowdown at a few hundred thousand calls or so. I just know that the length variable method I suggested had strongly improved my benchmarks at high iterations. and thus, I try to avoid table.insert where I can at this point. It’s likely overkill when you aren’t doing it on a mass scale but at the same time, it’s also not painful to implement or unreadable or anything, especially with += (thank you for this by the way this is my favorite syntactic change so far).

Additionally, if you refer to @Tomarty’s benchmark above he states that table.insert is about 70% slower. Again, I can’t be 100% sure this has to do with the length mechanism, Zeuxcg will probably be the one to confirm or deny that, but, that’s my assumption.

Also @tnavarts when I saw FireNoEvents get removed I had the big sad because I wanted to be able to desync parts from the server without doing mega hacky stuff. (Mainly because I can keep my terrain translation server side and per player I can translate terrain and players around them to massively mitigate deadlands artifacting, as well as being able to prevent wallhacks through making players appear in other locations)

Autterfly · June 25, 2020, 11:19pm

It seems that it’s partly right-ish. From looking at the source it’s unclear whether table.insert would append to the array or hash part, but it looks like for non-prealllocated tables it will prefer the hash part. It doesn’t call or search for __len, but it does count up how many items are present with the same binary search behavior as #. It could very well be part of the overhead you see, but at least it’s not linear in time. I reckon most of the overhead is from the call to table.insert itself, since Lua function calls are relatively expensive.

The part about tables not storing array length internally though doesn’t seem right. The 5.1 source code has a field called sizearray under the table struct which is used. When using # or getting the length of a table with only an array part, this value is read, which is why it’s O(1) behavior.

Tomarty · June 25, 2020, 11:28pm

I still really like the idea of BulkMoveTo from a performance perspective, but I feel like the Position/Orientation/Rotation performance problem could be solved through internal design changes.

I remember it looking something like this:

void BasePart::setCFrame(CFrame value)
{
	if (value != this->getCFrame())
	{
		Math::Orthonormalize(value);
		
		updatePhysics(value);
		updateGraphics(value);
		
		propertyChanged(property_CFrame);
		
		// This is almost always unnecessary for demanding use-cases
		propertyChanged(property_Position);
		propertyChanged(property_Orientation);
		propertyChanged(property_Rotation);
	}
}

Would be possible for these events be non-existent, but then piggy back off of CFrameChanged when they are requested or connected to? The Position/Orientation/Rotation updates are already unnecessary in almost all cases.

I’d be very interested in the possibility of BulkMoveMove.GraphicsOnly that doesn’t orthonormalize the cframe or update physics. One would be able to cheaply animate the resizing of a part/mesh graphically by transforming its cframe by CFrame.new(0,0,0, scaleX,0,0, 0,scaleY,0, 0,0,scaleZ). This would be useful for efficient character and foliage graphics animations that don’t require physics or raycasting, as well as for efficient model resize previews (although it’s likely material scale wouldn’t update, but would still be great for character resizing.) Cheap scale/shear graphical transformations are pretty standard in most engines.

tnavarts:

That’s the part I’m talking about. Try changing this to:
	GlobalBulkMoveToPartList = table.create(512)
	GlobalBulkMoveToCFrameList = table.create(512)
With whatever number comfortably fits your typical part load. I suspect that reallocating like that will actually be faster than looping through the whole table to clear it.

The problem is that the global tables are referenced by various modules.

I like this implementation because it uses fewer globals/upvalues, but it does require 2 extra __index operations. I may need to profile against some of these to see if there’s any difference, but I suspect it’s insignificant relative to the BulkMoveTo operation.

local GlobalBulkMoveToState = {0, {}, {}}

game:GetService("RunService").RenderStepped:Connect(function()
	
	-- ...
	
	-- End of frame:
	local len = GlobalBulkMoveToState[1]
	if len >= 1 then
		GlobalBulkMoveToCount[1] = 0
		workspace:BulkMoveTo(GlobalBulkMoveToState[2], GlobalBulkMoveToState[3], Enum.BulkMoveMode.FireCFrameChanged)
		
		len += 32 -- Add padding to reduce the chance of reallocation
		GlobalBulkMoveToState[2] = table.create(len)
		GlobalBulkMoveToState[3] = table.create(len)
	end
end)

Usage:

local len = GlobalBulkMoveToState[1] + 1
GlobalBulkMoveToState[1] = len
GlobalBulkMoveToState[2][len] = part
GlobalBulkMoveToState[3][len] = partCFrame

Does it matter if the part has any relation to the WorldRoot? If so I may need to come up with a different solution for ViewportFrame characters.

This is great, but I’m hoping vertex deformation will eventually allow me to seriously cut back on the number of parts I’m using. I’m curious if updating bones could be optimized in a similar way.

Definitely take your time with this. There’s a lot of potential for great optimizations, but it’s also a slippery slope for API bloat long-term.

tnavarts · June 26, 2020, 12:03am

I tried really hard to figure out a way to change this event stack in a general way before resorting to this API and my conclusion was that it would take a truly massive refactor to make this work in an acceptable way. TL;DR: signals are shared by Lua and C++ right now in an opaque way (firing a signal may invoke an arbitrary mix of Lua and C++ code), and that would have to change to support a generic performance improvement here.

Maybe someone braver than me will tackle it in the future, but there’s also the fact that simply crossing the reflection boundary to access a property a ton of times is slower than the table approach, and there’s no way to fix that safely on the horizon.

The problem is that the global tables are referenced by various modules.

Oh, yeah, I see what’s going on now. Different suggestion though: If you’re already passing around the count… why not never separately clear the table? You could just clear the extra left over from the last frame (if there’s fewer this frame than the last one) right before you invoke BulkMoveTo to make the lists the right size.

Does it matter if the part has any relation to the WorldRoot?

No. I wanted to support exactly that case case. If you try it out you will see that the LuaDraggers actually let you drag stuff which is inside of a ViewportFrame.

zeuxcg · June 26, 2020, 12:05am

sizearray is not the length of the table. It can be both smaller and larger.

As a result, # can be:

O(1) when sizearray is precise (true for arrays created through table constructors or table.create with value argument)
O(logN) when table has been expanded (true for arrays that grow using table.insert/et al, even if they start with table.create())
O(N) in some crazy contrived cases that I don’t know off the top of my head how to replicate easily.

I would not advise to treat this as a permanent fact since it will likely change. It might even be the case that appending will become as fast as assigning to known indices, although right now there’s indeed a fundamental difference between these, see guidance in https://roblox.github.io/luau/performance.html#creating-and-modifying-tables

Autterfly · June 26, 2020, 12:15am

I didn’t quite word my post right. At the end I was referring to the fact that sizearray is used when the table does not have a hash part, but also the last element of array isn’t nil.

As for the O(n) case it looks like you might purposefully need to go out of your way for it as the Lua comment just states “table was built with bad purposes: resort to linear search”.

Tomarty · June 26, 2020, 12:36am

That’s too bad. Have you considered ignoring Position/Rotation/Orientation by default until then? So long as it’s documented that the event behavior may change in the future I don’t think it would be a problem. The enum does have possible applications beyond events, like ignoring physics or skipping orthonormalization, but I’m not sure if FireAllEvents and FireCFrameChanged should be two separate enums, at least not while a refactor is possible. I’d be worried that preserving FireCFrameChanged’s behavior might get in the way if a refactor like that is ever done.

That’s a really great idea. I was initially worried about some parts/cframes failing to GC, but the reference would only hang around until the next frame when the end is cleared. Would there be any harm if the end of the cframeList wasn’t cleared and #cframeList > #partList? The part list could clear its end every frame before calling BulkMoveTo, while the cframe list could clear the end every N frames to clear references to cframes from old busy frames.

This has a lot more potential than one would think. My game calculates the cframes to do smooth animations for lots of parts, but I find myself cutting back because setting part.CFrame takes up half of my used frame time and I want it to run well on mobile. At the very least I’ll greatly increase animation smoothness and likely be able to run 50% more characters.

tnavarts · June 26, 2020, 12:54am

This would be a huge departure to make: Everything fires changed events right now. Having one case which doesn’t would be very surprising. If we were going to do that it would make more sense to do that for all “derived” properties. However, you do want changed events for most other derived properties, for instance, AbsoluteSize and Position on GUI objects.

I put a lot of thought into this and I’m pretty sure there’s no nice elegant solution to it. All the solutions are either kind of inelegant API wise or very difficult implementation wise.

Would there be any harm if the end of the cframeList wasn’t cleared and #cframeList > #partList?

The harm would be: it will throw an error if you do that! I had to make this case throw an error, because it would be too easy to get a small unnoticed edge case breakage if this were silently allowed.

zeuxcg · June 26, 2020, 1:59am

The cost of table.insert comes in two parts:

The cost of calling a C function from VM. table.insert isn’t currently using the fastcall mechanism that we use for optimizing builtins; we have experimented with this and doing this puts it at the same performance or a touch better than #t assignment, so this will happen at some point after a few other optimizations land.
The cost of computing the length of the table, which for expanding arrays usually involves a binary search. We have plans to eliminate this cost in the future, but this is a somewhat complex change so it’s on the back burner atm.

Tomarty · June 26, 2020, 3:09am

The difference is only that high when the table is preallocated which isn’t a common use-case. The difference between reallocate and preallocated in my tests is that reallocate starts with {} and preallocated starts with {nil, nil, nil, nil} before inserting, so it really just shows that you should preallocate your tables. It’s also possible that __len is just slower when the table is preallocated with lots of empty fields. These results would also likely vary a lot when testing different table sizes (I’m just testing 1-4).

According to new data from my reply, when filling an empty array with 4 values (a more common use-case):

table.insert(t, v) takes 24% more time than p += 1; t[p] = v on average.
table.insert(t, v) takes 14% more time than t[#t + 1] = v on average.

table.insert is likely to be optimized down the road though. If you really need performance go with incrementing, otherwise just stick with your preference for now.

Hexcede · June 26, 2020, 3:17am

This is a little curiosity I have about Changed events… Is there a particular reason for them being as slow as they are? I.e. does it involve some sort of weird “we have to create extra lua threads even if we don’t connect from lua” situation? I almost wonder if it’d be a good idea that some properties can have unique mechanisms which replace changed events, for example, since Orientation, CFrame, etc (afaik) don’t need to fire .Changed and don’t need to fire :GetPropertyChangedSignal() events would it be possible to convert these into some sort of “psuedo-event call?”

Also, @zeuxcg, that stuff is super awesome to hear, I was hoping table.insert would soon be able to replace the 40 or so extra lines I have haha. (To be honest, the more I can rely on a builtin to do stuff the better). Additionally, obviously I’m sure there’s a good reason for not doing this that I’m unaware of, but, is there any particular reason for not storing table length on the table itself? I.e. how lua was using getn and setn where a .n property got set (obviously there are a billion reasons why this in particular is ew like how the n index gets taken up and you can effectively spoof table lengths, which, would be genuinely terrible)

Lastly, @Tomarty thanks for pointing this out to me, I misunderstood haha. With the lowest, 14%, that’s a quite decent improvement if you take into account where I experienced a lot of slowdown to begin with… Over a hundred thousand iterations you’d save about 14000 iterations overall in that case and 24000 iterations with the better option. In my case I’d get that sweet 70% benefit since I heavily preallocate everything hence why the difference is so much for me saving me nearly 70000 iterations worth of time if I were to use table.insert.

Tomarty · June 26, 2020, 3:47am

When I say 70% longer, I mean that it takes 1.7x the processing time to complete compared to the fast option. Performance percentages can be notoriously confusing to communicate. Often someone doesn’t understand that the reciprocal will get you a completely different percentage (41% faster = 70% slower), but I think we just confused it with a ratio here. It also looks like it’s closer to 1.33x the processing time according to the average over the afternoon.

Hexcede · June 26, 2020, 4:00am

Ah this is a good point. So then, looking at the actual algebra of this, it should take roughly 1.7x the amount of iterations to reach the same amount of time, and if it’s 70% slower, that’ll be numberOfFasterIterations * 1.7 = 100000 so instead of 70k iterations saved (equivalent to 30k) it’s 100000/1.7 or ~58k iterations, resulting in about 41176 iterations saved.

( Let’s hope this time I haven’t did oops again)

And lastly, if I’m understanding correctly, it’s unfortunate to hear that it is actually less of a boos on average.

tnavarts · June 26, 2020, 4:51am

Check out game.ItemChanged. If you consider that that exists you might not be as surprised that there’s actually more work than you expected going on when you modify a property.

Tomarty · June 26, 2020, 5:08am

This is still active? There are so many things fundamentally wrong with that event. How would it even behave for CoreGui’s? Does it special-case for that by traversing the game tree?

We must gather stats on it’s usage, notify devs, and ensure its swift destruction. I will personally recode all of the old games manually if it’s the only way.

Hexcede · June 26, 2020, 6:06am

I’ve used the event in the past for detecting certain noobsploits (haha I just made up a word for poorly written exploits). Afaik it fires for any object despite context level and even objects in nil. It was (might still be, not sure) possible to do something like CoreGui:IsAncestorOf(object) as well and thus you could probably detect a whole number of things you probably shouldn’t (especially when you could tostring locked objects because then you could likely narrow it down to things like detecting the click of a button or the scroll of a window, maybe even identifying tab switching in settings for example).

tnavarts · June 26, 2020, 6:10am

Oh no… you’re dumpstering your performance if you do that. Nothing’s actually listening on that event normally in a live game (it’s only used for stuff in studio), so the engine can skip firing it (though it still has to go down the relevant call chains to the point where it checks it). If you listen on it now it has to fire it for all those prop changes, ouch.