LUA Interpreter: Add locking primitives to the task library

Problem

As a Roblox developer, it is currently too hard to guarantee exclusive access to a resource. Since LUA is an interpreted language, there is no way to protect critical sections of code from interference when another thread needs to access the same resource when using LUA based access controls that are not atomic.

Use Cases

Consider the following code:

local totalSize = 1000000
local blockSize = 10000
local blockCount = math.floor(totalSize / blockSize)
local nextBlock = 0
local mutex = false
local threadCount

-- Acquires the lock for exclusive access.
local function lock()
	while mutex == true do
		task.wait()
	end
	mutex = true
end

-- Releases the lock.
local function unlock()
	mutex = false
end

-- Performs the required task, calculating
-- the start and end block positions based
-- on the next block of items to process.
function doTask()
	while nextBlock < blockCount do
		-- Calculate start and end positions.
		count = blockSize - 1
		lock()
		-- Start Critical Section
		start = nextBlock * blockSize
		nextBlock += 1
		unlock()
		-- End Critical Section

		-- Perform Task
		for i = start, count, 1 do
			-- Perform your task here.
		end

		-- Breaks the execution up to prevent
		-- script timeout errors.
		task.wait()
	end
end


for i = 1, threadCount, 1 do
	task.spawn(function()
		doTask()

		-- Introduces a slight delay to help prevent a race
		-- condition and to reduce lock contention.
		task.wait()
	end)
end

Within the function doTask(), the critical section uses the value of nextBlock twice. It is expected that the value does not change between accesses which is why I have it wrapped with lock() and unlock(). However, even with that, exclusivity is not guaranteed because the operation is not atomic. Because of this, the value of nextBlock can be overwritten between statements and can lead to a race condition which is difficult to diagnose. The only true way to guarantee atomicity is to use the follow CPU instruction (Intel/AMD CPUs):

LOCK XCHGCMP [Address], Register

The above CPU instruction is fully atomic even in multi-threading/multi-processor environments because of the LOCK prefix. As such, I suspect the LUA interpreter controls access to shared resources for a single LUA statement, but if multiple statements are required to access/update a value, then locks are needed in LUA code to designate critical sections for exclusive access while the LUA instruction pointer is in the critical section.

Conclusion

If Roblox addresses this issue, it would make it easier to designate a block of LUA code as being a critical section so exclusive access can be guaranteed. This will help to eliminate a source of issues relating to multi-threading such as race conditions which can be very difficult to pin down.

On a personal note, all I need is access to the spin lock primitives. From there, I can build semaphores, mutex locks, and read/write locks from that.

4 Likes

You seem to be misunderstanding how code is executed on Roblox. Luau code is still completely single-threaded (except when intentionally parallelized using Actors which is not what you are talking about here).

The coroutines / fibers / “threads” we get on Roblox come from being able to execute other pieces of code on the same CPU thread while another piece of code is yielding. For example, if you insert a task.wait() in a block of code, that allows other code to run while the original code is waiting. Roblox Luau specifically uses a cooperative scheduling approach, if you want to learn more about how this works conceptually.

As a result, we do not need to care about locking and multithreading at the instruction level. We can always guarantee everything will be executed safely as long as the code does not yield (or go over the script execution timeout which is ~10 seconds iirc).

Locking is occasionally useful at higher levels of abstraction, but it’s not very difficult to implement if you understand how Luau’s scheduling works (basically you just store a queue of yielded “threads” which are waiting to be executed - inserted into the queue using coroutine.running() to obtain the current “thread” - then run them one after another in an “executor” “thread” using coroutine.resume()/task.spawn(). Do not implement this recursively as with larger queues you will hit C stack limits).

If you want to get a better understanding of how code executes on Roblox, I recommend you observe how it behaves in the microprofiler. e.g. create a script with some code that executes every frame, using debug.profilebegin() and debug.profileend() to get readable labels to identify your code, then take a microprofiler dump and look at how the code executes.

To demonstrate, here are two cases:

Case 1: Completely safe, code is executed in sequence. (Result is 20000, which is correct.)

local counter = 0
task.spawn(function()
   for _ = 1, 10000 do
      counter += 1
   end
end)
task.spawn(function()
   for _ = 1, 10000 do
      counter += 1
   end
end)

Case 2: Not safe, as the temp variable may be out of date. (Result is 60, which is half of what we should get.)

local counter = 0
task.spawn(function()
   for _ = 1, 60 do
      local temp = counter
      task.wait()
      counter = temp + 1
   end
end)
task.spawn(function()
   for _ = 1, 60 do
      local temp = counter
      task.wait()
      counter = temp + 1
   end
end)
5 Likes

Lua’s single-threaded cooperative multitasking model on Roblox does allow many operations to be “atomic” by ensuring they don’t yield but, I’m not completely convinced that OP’s point about wanting lower-level primitives is moot. There are definitely more complex scenarios where higher-level synchronization mechanisms having access to robust, low-level primitives could make implementing those mechanisms easier and ultimately, more reliable.

A couple examples that may warrant an advanced synchronization mechanism off the top of my head:

Resource pooling – Managing a shared pool of resources like connections, game objects or any other type of reusable entities are require resources to be checked out, modified and returned to the pool atomically.

Caching Mechanisms – Implementing a cache needs to handle scenarios where multiple routines may try to populate or invalidate a cache entry simultaneously.

Task Queues – Tasks that need to be processed in a certain order or have dependencies on each other require enforcing that order or those dependencies, more-so when tasks are spawned in different routines.

Ordered Operations – A sequence of tasks that need to happen in a particular order across multiple different routines. Each task might be non-atomic by itself but, needs to happen in a particular sequence with other tasks.

If you’re dealing with simpler, non-yielding operations then, the existing model is probably fine. If you’re attempting to tackle more complex, stateful and interactive systems then, advanced synchronization mechanisms would make more sense.

2 Likes

Yeah, I currently work in work in a codebase that deals with many of those use cases. There are a variety of ways to implement each of those specific things which do not require additional features, and each specific scenario tends to have specific requirements. And if you don’t want to do something yourself, you could just use a library that someone made. So I’m not sure I would get much value out of having built-in versions of them.

2 Likes

So it executes like the old single-core non-preemptive multitasking then? Perhaps I am thinking too much into this, but from my background which is operating systems, thread safety is a key requirement when you write software on the other side of the system call tree and you have to initiate a 0x88H interrupt call gate to switch from ring 3 (user space) to ring 0 (kernel space). I write kernel level code so these are issues that I’m keenly aware of.

Now the thing about actors is interesting.

2 Likes

One thing that your list of use case examples that is missing that I’m using (I guess resource pooling might qualify) is the worker thread model (also known as thread pooling). I have a mechanism within a primary module that takes all network requests and adds them to a queue. Then the worker threads extract from the pool to work on the request.

For this to work, the individual threads have to lock the queue so it can safely remove the top item from the queue.

2 Likes

Yep that’s right. You have definitely been thinking too much into this haha. One of the big benefits of Lua is that is a very simple language. And that is partially because Lua does not do multithreading at all (except via external APIs such as Roblox’s Actors).

As for Actors I would hazard a guess and say that, based on the API and its limitations, the way multithreading is supported is basically by spawning a bunch of independent Luau VMs and then having them communicate with serialized messages (instead of directly sharing the memory of tables and strings etc). i.e. each Actor is its own VM with its memory completed isolated from all other VMs. I dunno how accurate this is but that’s my own hunch anyway.

1 Like

This is a fair point but, libraries often deprecate, go unmaintained or is too much bloat. Built-in versions would be highly optimzed and provide a consistent baseline to base your approach on. That matters a lot in production and upkeep.

1 Like

Yes. Luau (Roblox) environment is reminiscent of old single-core non-preemptive multitasking systems. Which would feel like a significant downgrade. In traditional systems when you’re dealing with preemptive multitasking, you have to ensure thread safety at a deeper, granular level.

This is largely correct in that transitioning from user space to kernel space involves intricate operations (i.e 0x88H interrupt call gate). While these operations are powerful and come with a great deal of control, I don’t think it’s wise to dismiss / discount the fact that you’d be adding additional complexity on a platform that simplifies their approaches. Context here matters.

The biggest roadblock would be that Luau’s cooperative multitasking does a great job at simplifying these concerns because the engine serializes execution paths that yield. This naturally eliminates the need for low level locking mechanisms to ensure thread safety but, you’re correct in that it introduces new challenges around when and how tasks yield control.

I would consider this a decent alternative because, the actor model inherently avoids issues related to shared state by encapsulating both state and behavior within actors. Each actor processes messages from its own queue one at a time and is close-to a single threaded environment within each actor’s context. This would minimize the need for locks or other low level synchronization mechanisms.

Ultimately, as with everything, it’s another tool in the toolbox that offers it’s fair share of trade-offs but, it can simplify some complexities inherent in both traditional and cooperative multitasking environments. That being said, there are use-cases that would warrant a legitimate need for these mechanisms but, it’s unlikely the use-case is wide enough to be justified.

Sidenote: I work in embedded systems. Lots of low level, OS related stuff. Rare to see someone else around here that shares similar experiences. Given that this is as high-level as it gets other than Python (lol).

2 Likes

I avoid python like the plague. My go to language is C, C++, or ASM. However, when it comes to web, I am partial to PHP. I’ve written a couple of implementations of pthreads library. My specialty is kernel internals. You start a new kernel by starting with the timer interrupt and build it from there. Once you can get a proc struct with job control and a scheduler, it’s the implementation of various system calls and device drivers.

But memory handling, VM paging, and management of the cache and TLB are critical for performance.

I have done some work on microcontrollers and digital signal processors. Bare metal programming with no RTOS in most cases. It’s basically libraries built on libraries to build abstractions to simplify the functional logic.

2 Likes

Python is unescapable. It’s great for prototyping and automation. Not as good as using a shell script though. C is go-to. Rust is not bad. I like that I don’t have to implement things from scratch and avoid cmake. Cargo is quite nice too.

Bare metal is the best but, extremely challenging. It’s straight to the point. No fiddling around with libraries or higher level semantics. Just hardware and code. RTOS environments can get annoying but, stripped down GPOS (usually some flavor of linux) helps. Otherwise, I’m usually working on SCP (Serial Communication Protocols) so, UART, SPI, I2C. Sometimes TCP. Wanting to get more into the USB or ADC / DAC side of things.

1 Like

Eh…bare metal isn’t all that challenging. But then again, if you’re used working in that environment… However, when doing bare metal, the software is specific to the particular device that you are using. That’s why I like to build micro-device drivers first. Build a driver catalogue of all the hardware on the device, then work on the functional logic. That way, if you need to migrate to a different device in the same family, you just need to change a few constants.

What devices do you like to use? I’m partial to Atmel chips, but I have been known to use Dallas Semiconductor, Harris, and National Semiconductor parts. PIC is a joke. I remember when Microchip tried to buy Atmel. An engineer said something like the following:

So you're going to take a superb device, make it subpar with less hardware and memory, charge more for it, and tell us it's good for us because of your prominence in the industry? I think not.

I remember falling out of my chair I was laughing so hard at that. The problem is, it’s true.

I’ve also had the opportunity to write software for engine management systems used in vehicles (Also known as the Powertrain Control Module). That’s about as embedded as you can get. Lots of crazy stuff goes on in that. And there’s the CAN as well. It’s very interesting stuff too because you have different strategies for different operating conditions. When you start the engine from cold, there’s a strategy for that (open loop mode). Then when the engine warms up a little (between 90 and 120 seconds), you switch to a different strategy for closed loop mode where it starts taking inputs from all the sensors and calculating the fuel load. For distributor less ignition systems, you have to also have an ISR for the crankshaft position sensor. In some systems, that signal just resets a timer to 0 and starts counting up again. Lots of timers in that environment. Ignition timing, spark advance, when to open the fuel injectors, how long to keep them open, etc… Fun stuff.

2 Likes

The language that Roblox uses deviates enough from Lua that it’s a fork (luau), and there’s docs at https://luau-lang.org/. You could also read the base Lua docs. Consider giving those a read, it has lots of interesting content.

Also one very big difference is that Lua (and luau for extension) are very low level and run in a sandbox. It’s very far away from bare metal. For a more fair comparison, JavaScript, runs in a sandbox and is low level. You should try Luau stand-alone sometime.