PSA: Please don't rely on the format of `debug.traceback` results

zeuxcg · May 13, 2019, 8:55pm

During testing of the early version of our new Lua VM, we found a few games that relied on the precise format returned by debug.traceback function. This is a PSA asking to change code that relies on that - be aware that we reserve the right to change the format without notice.

The changes have to do in part with the cleanup of the resulting format that make it match stock Lua better, and in part due to fundamental differences in how new VM works and what it can support.

Here’s an example of debug.traceback output for a semi-complex example:

function foo()
    print(debug.traceback())
end

local function bar()
    foo()
end

local Moo = {}
Moo.baz = function()
    bar()
end

function Moo:test()
    self:baz()
end

Moo:test()

Lua 5.1 prints this:

stack traceback:
        test.lua:2: in function 'foo'
        test.lua:6: in function 'bar'
        test.lua:11: in function 'baz'
        test.lua:15: in function 'test'
        test.lua:18: in main chunk

Roblox Lua prints this today:

Stack Begin
Script 'Workspace.Script', Line 2 - global foo
Script 'Workspace.Script', Line 6 - upvalue bar
Script 'Workspace.Script', Line 11 - method baz
Script 'Workspace.Script', Line 15 - method test
Script 'Workspace.Script', Line 18
Stack End

Our work in progress VM currently prints this (this may change!):

Workspace.Script:2 function foo
Workspace.Script:7 function bar
Workspace.Script:12
Workspace.Script:16 function test
Workspace.Script:18

… you get the idea. We have found games that looked at the output of debug.traceback and, for example, expected it to always start with “Stack Begin”, or to have methods annotated with “method” instead of a generic “function”. If you have code that does this today, please change it - debug.traceback should be used for debug diagnostics and error analytics exclusively.

We are updating DevHub documentation that, unfortunately, treated the output format as a contract, to note that the format isn’t stable as well.

ThatTimothy · May 13, 2019, 9:23pm

This makes sense. Glad to see that the new Lua VM is coming along. Hopefully, the error messages will be nicer.

Also, it’s nice to see a public PSA about stuff like this without it being directly changed.

posatta · May 13, 2019, 9:27pm

Out of curiosity, why are the line numbers different in the new VM?

zeuxcg · May 13, 2019, 9:30pm

Work in progress There’s sort of a good reason for this, but we will fix this.

H_mzah · May 13, 2019, 10:02pm

Thanks for the warning, scripts would’ve errored because I was planning to use it according to its current layout, but I’ll hold off on messing around with specific info from it and just print the whole thing.

Excited to see what the future has in store.

Noble_Draconian · May 13, 2019, 10:55pm

Semi-unrelated, but are there any plans to support accurate debug tracebacks with errors that occur inside spawned/coroutine’d threads and pcall()'d threads?

Corecii · May 13, 2019, 11:13pm

The only code I use that relies on stack traces relies on it for a similar reason, getting traceback from a protected call (pcall). By spawning a new thread using a BindableEvent then listening for an error in the output log using a unique module name, you can actually get a full traceback for a “protected” function call. Basically traceback pcall.

I realize this is super hacky, but it’s the only way we can have traceback + pcall + yielding right now. Using xpcall gives you tracebacks but no yielding. This method will only be affected if the script name is truncated or if modules named and cloned at runtime show inaccurate names.

Noble_Draconian · May 13, 2019, 11:15pm

I don’t rely at all on the format of the error, It’s just a hassle to debug your code when the error traceback only leads you to the pcall or thread spawn, and not the actual line itself where the error occurred.

Anaminus · May 13, 2019, 11:53pm

I was about to make a request related to this. Any chance we could get traces that are more semantic? That way we don’t have to worry about the format at all.

zeuxcg · May 14, 2019, 12:15am

These two situations are somewhat different.

For pcall, we currently make a thread in pcall implementation to be able to yield inside of it. This is a significant problem wrt performance. There’s a desire to fix this by reimplementing pcall but no concrete plan yet (aka we know we want to do it but we don’t yet know how to best approach it).

For coroutines, they are semantically disjoint from the resuming thread in a way, so it would need to be handled differently. Can you explain the situation where you want a callstack to span several coroutines?

Maximum_ADHD · May 14, 2019, 12:54am

Sorry about that. I didn’t anticipate any changes being made to the structure of the stack trace.

Do you guys plan to make any changes to the debugger instances? They’re documented on the DevHub as well: https://developer.roblox.com/api-reference/class/ScriptDebugger

Agreed. It’d be nice to have a structured dictionary instead of having people try to parse the stack trace.

Noble_Draconian · May 14, 2019, 1:46am

With the way my code is structured, I tend to call functions inside special modules via coroutine.wrap(). Said Start functions also call module functions inside coroutine.wrap().

My code is structured in a “Service/Controller” (modified MVC) format; that is my framework loads, initializes, and automatically starts services when the server runs. Each service handles a different aspect of the game, e.g. DataService handles the loading/saving of player data.
On the client I have something similiar, called “Controllers”. Controllers behave just like “Services” do, except they run on the client instead of the server.

A specific example would be the MarketController in my game. It handles sending requests to the market service (server side) and it also loads up the market/inventory UIs.

The code in question roughly looks like this (irrelevant code was removed):

local MarketController={}

local ShopUI;
local InventoryUI;

function MarketController:Init() --This is called when my framework loads the controller
    ShopUI=require(script.ShopUI)
    setmetatable(ShopUI,{__index=MarketController})
    InventoryUI=require(script.InventoryUI)
    setmetatable(InventoryUI,{__index=MarketController})

    ShopUI:Init()
    InventoryUI:Init()
end

function MarketController:Start() --This is called when the framework has loaded all controllers
    coroutine.wrap(ShopUI.Start)(ShopUI)
    coroutine.wrap(InventoryUI.Start)(InventoryUI)
end

return MarketController

In the ShopUI and InventoryUI modules (which handles UI state and interaction for their respective UIs), I can call various methods via self, such as self:SortItems() or self:PurchaseItem() (a method of the marketcontroller that is exposed to the shop UI module via __index).

If there is an error inside of the ShopUI or InventoryUI modules, the stack trace isn’t accurate/only traces to the coroutine.wrap(). This makes debugging embedded modules a hassle, as a lot of the systems in my game use this method.

IdiomicLanguage · May 14, 2019, 2:11am

Since we shouldn’t rely on the format of stack traces, could we receive an actual interface to detect errors and handle them game wide? I’ve personally used the LogService.MessageOut and ScriptContext.Error events in a live game to detect errors which would then be parsed and stored in a database and sent to me via text. It would be wonderful if there was a better method to be on the lookout for errors and record them.

zeuxcg · May 14, 2019, 3:18am

Assuming you’re asking for an API to give a structured callstack representation, I’m not sure we should make one. It’s easy to do, but it seems like a trap.

Callstack is an array of call frames, where each frame is currently identified by a script, a line number and a function name. However:

Callstack entries can arbitrarily disappear and reappear due to inlining and changes in inlining heuristics
Callstack entries can arbitrarily disappear due to tail calls (we don’t have guaranteed tail calls but may introduce restricted tail calls for optimization in the future)
Line numbers can arbitrarily change due to changes in code generation (for example, in a multiline function call which line to associate with the call itself is ambiguous)
Function names can arbitrarily change due to changes in compiler (for example, which name to assign to function Foo.Bar:Baz() is unclear)
Function names can arbitrarily disappear and reappear due to changes in naming heuristics when names aren’t specified (see Moo.baz example from the original post)

Effectively, we can make an API that produces a callstack, but every single bit of information returned by this API will be fragile. At which point you’re probably better off not having an API in the first place.

zeuxcg · May 14, 2019, 3:19am

ScriptContext.Error is still the recommended way to detect errors in a live game (and log them via a third-party analytics service). We do need better first-class support for this on the platform level, but that’s not directly tied to the format and mechanics of error generation.

Anaminus · May 14, 2019, 4:25am

It seems like all of those problems would also occur with with the current string-based traceback.

As pointed out by @Maximum_ADHD, we already have a way to get a full featured callstack. It’s limited to the studio debugger, and rightfully so, being chunky and expensive to lug around as most debugging stuff is.

On the other hand, all I’m looking for is a table with some fields containing the same information already present in the traceback string:

Source (as a LuaSourceContainer if possible, just the string otherwise)
Line number
Variable name/type (if available)

If the information can be put into a string, then surely it can be put into a table.

I’m trying out an errors-are-values approach, where a function returns the error rather than throwing it. Usually, this doesn’t require anything more than a single stack frame, if that. The fewer the frames, the cheaper it is to create errors. If needed, the error can be wrapped in another error one level up, containing the next frame, and so on.

My theoretical error-creating function might look like this:

function NewError(includeStackFrame, ...)
	local err = {
		message = pack(...),
	}
	if includeStackFrame then
		err.frame = getStackFrame(2) -- frame of caller
	end
	return err
end

Currently I use debug.traceback to get the full trace, parse out just the first frame, and attempt to locate the referenced script. All assuming the arbitrary script name isn’t trying to sabotage the parser.

zeuxcg · May 14, 2019, 5:26am

Correct, but importantly there’s more obviously no promise of stability. Exposing a “nicer” API doesn’t seem valuable if the API can’t be relied upon.

Currently I use debug.traceback to get the full trace, parse out just the first frame, and attempt to locate the referenced script.

Why parse the first frame out instead of keeping the entire trace around? FWIW debug.traceback is substantially faster in the new VM.

IdiomicLanguage · May 14, 2019, 6:25am

This seems to be fundamentally backward to me. The structured / parsed data should be the original object and if needed to be displayed in a human-readable format then it can be easily stringified in any desired format.

The main advantage to having access to the structured (parsed) error data is that it can be manipulated without human intervention. Error counts and statistics for files and functions can be generated which would be helpful even if not always complete. The data may be useful for game-wide error handling as well.

Anaminus · May 14, 2019, 6:27am

If stack traces aren’t stable or reliable, then there’s really no point in having them at all. In fact, exposing them in any way whatsoever would do more harm than help.

Mainly because it adds noise. Since I have errors being returned rather than thrown, they must be handled all the way down the stack. Consider this example:

function Add(a, b)
	if type(a) ~= "number" or type(b) ~= "number" then
		return 0, WrapError(nil, "value must be a number")
	end
	return a + b, nil
end

function AddEach(...)
	local a = 0
	for i, b in ipairs({...}) do
		local c, err = Add(a, b)
		if err ~= nil then
			return 0, WrapError(err, "bad argument #" .. i)
		end
		a = c
	end
	return a
end

local total, err = AddEach(7, 17, "37", 47)
if err ~= nil then
	print("ERROR:", err)
	return
end
print("RESULT:", total)

The WrapError function wraps an error around another error. Code can then unwrap or inspect the error and decide what to do (usually it’s propagating the error). When the error is converted to a string, it can step through the chain of wrapped errors to construct a readable result:

bad argument #3: value must be a number
Stack trace:
	script:13: function AddEach
	script:3: function Add

Such errors could also be constructed to exclude the stack frame entirely. This might be useful for patterns like Promises, which otherwise produce bloated stack traces that are hard to read.

zeuxcg · May 14, 2019, 6:33am

The point of stack traces is to convey information about the location and circumstances of the error to a human. They are not stable in that you can’t rely on them having specific properties that persist indefinitely - for example, given a hypothetical structured API, debug.getstackframe(1).function == "foo" may or may not return true depending on various factors mentioned above in the thread. You should use them to be able to capture stack information and later display it or log it.

For example, a very valid use of debug.traceback is to build generic constructs, like promises, that can record error location and provide debug utility in other cases, see roblox-lua-promise/lib/init.lua at master · LPGhatguy/roblox-lua-promise · GitHub as an example. Note that in all uses of the backtrace the result is saved to be fed to a formatting/printing function later.