Ability to spawn threads with proper error handling

As a Roblox developer, it is currently too hard to spawn threads with error handling and spawn/delay cannot take additional arguments.

If Roblox is able to address this issue, it would improve my development experience because it would allow me to cleanly spawn a function with arguments but still get proper error handling.

Problem

spawn and delay doesn’t accept additional arguments
coroutine.resume eats errors
xpcall allows you to get the correct stack trace but can’t be outputted correctly

Ideally I would want to have full control over the arguments given to whatever thread I am spawning without making it impossible to get proper error messages.

Potential solutions

There are a bunch of potential solutions I can come up with. #6 is the most general and most useful one and is the one I would prefer personally.

  1. Add an option or direct alternative to coroutine.resume that doesn’t eat errors (Not a clean solution but decent)
  2. Stop eating errors from coroutine.resume (Might be annoying, and could cause lag due to error spam)
  3. Pass extra arguments to spawn/resume (Doesn’t fix most issues, keeps two useless arguments)
  4. Add an option or direct alternative to them that lets you overrride arguments (Not a clean solution)
  5. Add a new way to spawn threads or functions with the two core problems solved (Fixes most of the problems just fine)
  6. Add a way to output “fake” errors (e.g. as debug.error) or allow error to take threads too. (Fixes all the issues, generalized, adds new uses, but most complex)

It could accept a level/thread, just like how error accepts a level, but wouldn’t kill the caller.

Example:

local success, err = coroutine.resume(thread, ...)

if not success then
	debug.error(err, thread) -- Outputs the error like it came from the thread
end

This would also allow you to still return the results assuming the body doesn’t yield which could be very useful.

(P.s. you can prove that threads remember the error location with the new debug.info function by passing the thread, a 1 for the first level, and then "l" to get the current line number of the thread)

Uses

Here are some of the examples of why this would be useful.

This eliminates ever having to do error handling and hackily outputting the error with warn or print. Could eliminate the need for xpcall which is bulky.

If you’re utilizing LogService in order to collect errors in your games you don’t need extra code paths for the above

Makes it possible to jump to correct locations by clicking the stack.

Having cleaner errors when spawning threads makes this a lot easier to debug:

-- Assuming #6 this is how you could define betterSpawn (you could assume #5 just replaces this):
local function betterSpawn(func, ...)
	local thread = coroutine.create(func)
	local success, err = coroutine.resume(thread, ...)

	if not success then
		debug.error(err, thread) -- Would output the error perfectly (But execution continues below)
	end
	return success, err
end

local function addChild(child)
	-- Stuff
end

-- I use this pattern a lot for things like Players:GetPlayers() and Players.PlayerAdded especially
for _, child in ipairs(something:GetChildren()) do
	betterSpawn(addChild, child) -- Spawn it with one of the solutions
end
something.ChildAdded:Connect(addChild)

Assuming you can output custom errors like in #6 you could use this for better debugging in general.

It would also make reverse engineering obfuscated code easier with tools, meaning exploit scripts can be patched faster by developers. This is really useful with loadstring and/or for spawning “fake” scripts like in my prototype script sandboxing tool (Which is designed to fit a use case like this)

10 Likes

A prior proposal that addresses the same problem:

I’ve also considered a similar solution, where there’s a function like debug.error that emits an error without killing the caller. The issue with this and betterSpawn is that it doesn’t resolve the question, “who owns the thread?”. When you resume an arbitrary thread, it could yield to the engine (via wait, yield function, etc), which assumes control of the thread. Or, it could just call coroutine.yield, or throw an error without ever touching the engine.

The resumer must do nothing for the first case, and let the engine take care of the thread. But the resumer must also handle errors for the second case, or they will be “eaten”. This contradiction can only be solved by cooperation from the thread (so it can no longer be arbitrary), or careful sandboxing of yielding functions (the simplest would be to wrap coroutine.yield to return a sentinel value that causes the resumer to avoid giving ownership to the engine).

Because of the additional work and complexity required to resolve this issue, debug.error cannot be considered a good solution for the presented problem. The resumeThread of my proposal would resolve the issue by giving ownership of the thread entirely to the engine in the first place. The implementation would actually be very similar to betterSpawn, except that it wouldn’t return any values.

I agree that there would be utility in having a debug.error function, but not for the stated problem.

2 Likes

It’s funny that the first reply on that post is from me, I must have forgotten about it. I wasn’t able to find it when searching for existing posts on this.

After reading what you have said I would have to agree that debug.error isn’t a great solution to this specific problem, it just doesn’t make much sense, but, its definitely still a solution.

Who owns the thread is already defined in every case, the difference between spawn and the specific example I gave is who owns the thread at the start. With spawn the engine owns the thread but with the lua betterSpawn the caller owns the thread, just like with coroutine.resume (since that’s what’s going on already)

Currently, the behaviour when using wait from a coroutine is that the resumer gets true back with no extra data. The engine then throws an error after the wait resumes (assuming you call an error). When using coroutine.yield() the resumer gets true back with any arguments passed through coroutine.yield(). When throwing an error before yielding the resumer gets back false with the error message. This is the only time you would want to handle the error yourself with debug.error, and, the effect is equivalent to spawn except you can pass arguments.

The benefit of the way I’ve defined it is that if the thread body does not happen to error, and calls coroutine.yield or returns, you get the results back. If you are the thread doing the resumption you are the owner, and you get back anything the thread returns through the yield (or if it dies before then). You just want to handle every resume call you make, the same way you would already handle this case.

Basically, whatever context doing the resumption is the owner, which is how it already works with spawn.

You do bring up good points. If the thread yields, the betterSpawn function I defined would miss that error so it’d get “eaten” still. If a betterSpawn function was implemented you’d have a similar issue when using coroutine.yield. With wait or other generic yields where the engine is the one in control, errors would not be returned to betterSpawn and for the built in you have the problem of deciding if an error should get thrown or not. With the prior I would expect that a thread hypothetically resuming it in the future via coroutine.resume is the one to take control, just like how it already works with spawn.

Example:

local t = coroutine.create(function(...)
	wait() -- Passes off ownership
	error(...) -- Error is outputted
end)
print(coroutine.resume(t, "abc")) -- Prints true (Thread didn't error)

But, that vs spawn is not really any different other than who is doing the handling. With my example betterSpawn the script doing the spawning is replacing the engine. In other words, the engine’s work around the error handling has been replaced by lua code with the way I wrote it out.

I think solution #5, or, in other words, basically what you proposed does make more sense, because, as a solution to this problem debug.error is not a direct solution to the problem, its what you’d want to do if you need much finer control over what’s going on.

I would think something like debug.error could just “pretend” to create an error and just not touch the real thread at all, but I would think that’s not something the engine can do easily. In other words, not killing the caller, not killing the target, just creating a pretend error message to be outputted to the console, so, basically the behaviour of print/warn but its creating a pseudo error message instead.

1 Like

Thinking about it more, you don’t need to anything special if betterSpawn is to behave like spawn (minus the delay). Within an engine-owned thread, calling coroutine.yield basically kills the thread; the engine does nothing, and wont make any attempt to resume it later. betterSpawn, wanting to behave like spawn, can apply this behavior as well, which simplifies the implementation back down to what you already had in the OP. Only if you wanted to have coroutine.yield do things like resume the thread later, would you have to get into complicated things with sentinel values and whatnot.

My only gripe then is that betterSpawn returns the error. It implies that the caller will always have control over the thread, and that betterSpawn will somehow always return the error even if it occurs after yielding to the engine, instead of just after the first resume. This sort of thing is something that confuses a lot of people, which is why coroutine.resume gets misused so often. That was another part of the reason why I originally ruled out a debug-error-like solution.

Interestingly, implementations that are almost exactly the same as betterSpawn have been written over and over again. Here’s me with one last year:

And here’s another one sometime later:

https://devforum.roblox.com/t/what-are-bad-programming-habits-that-you-see-often-by-scripters/761033/234

And another all the way from 5 years ago:

1 Like

Yep. The whole point of resuming a thread is that said thread is running separately. It feels weird that you can only sometimes get results from it but makes perfect sense if you think about the fact that if it did yield in order to return results it would actually have to yield the caller. It feels like its implying you’ll always have control, but, at the same time it makes perfect sense, its something that’s just counter-intuitive no matter how you implement it.

With spawn, if you export the thread somehow, and then do coroutine.yield you can externally do coroutine.resume later and resume the thread that would appear to be dead. The error is no longer tracked by the engine, its tracked by the thread that called resume.

Also, fun fact, if you coroutine.yield a thread and don’t hold a reference to the thread it can actually GC since there is nothing to resume it. I thought that was very neat, that means that permanently yielding threads (with the exception of things like WaitForChild or wait) can actually GC.

I actually have to spawn the sandboxed code in a thread to basically “reset” the call stack, since, the call stack above can pretty much be arbitrary and I can’t account for that. So, I’m basically doing a regular function call but wrapped in a thread. I also do this because if I didn’t it could result in arbitrary stack traces too.

Allow me to horrify you with a snippet from how I am currently doing error handling:

-- Spawn a sub thread for running the callback
local subThread = coroutine.create(function(...)
	return (function(success, ...)
		--wait()
		-- Save the results
		local results = table.pack(...)
		if parentThread then
			-- Check if the parent thread is waiting for our results and resume it if it is
			if coroutine.status(parentThread) == "suspended" then
				coroutine.resume(parentThread, success, results)
				return
			end
		end
		-- Return the results
		return success, results
	end)(xpcall(callback, function(err, ...)
		local success, msg = pcall(debug.traceback, err, 3)

		if success then
			return msg
		else
			return debug.traceback("Error occured, no output from Lua.", 2)
		end
	end, ...))
end)

-- Resume the sub thread and pass arguments
local resumed, success, results = coroutine.resume(subThread, table.unpack(args))

-- If the thread failed to be ran
if not resumed then
	results = {success}
	success = resumed
end

-- If the sub thread isn't dead we want to wait for it to complete
if coroutine.status(subThread) ~= "dead" then
	-- Wait for results
	success, results = coroutine.yield()
end

This accounts for yields by running the function in the body of the thread, the parent yields if the thread doesn’t return results immediately, and if the thread doesn’t return results immediately the parent gets resumed after, so it basically looks like this (when the sandboxed code yields):

  1. Sandboxed code is spawned
  2. Sandboxed code yields, runner thread yields
  3. Sandboxed function eventually exits, spawned thread resumes runner

It looks like this when the sandboxed code doesn’t yield or errors immediately:

  1. Sandboxed code is spawned
  2. Runner thread doesn’t yield because the spawned thread signaled the code executed
  3. Sandboxed function has already exited

When the sandboxed function errors its caught in the xpcall with the stack trace (which is not inside of an extra function body, instead its actually what is returned by xpcall as the second result)

To horrify you even further, that’s in a module script, and, that snippet is actually in another coroutine which is running in the main body and getting resumed by a function returned from the module. You might be wondering “Why on Earth that’s horrible?” and the reason is because if I run it in the function returned by the module it actually still leaves a way for the sandboxed code to escape via getfenv(0).

The most horrifying part about this is that none of my code is necessarily hacky if you only look at one part of it, its all using features completely as they were intended and its using patterns as they are intended.