CPU caching, and why buffers & native codegen are awesome

This is a very complex post. It’s not meant for beginners, and it’s meant to describe and explain multiple hardware-level & very low-level software concepts, and how they relate & apply to Roblox.

Goals

By the end of this post, you should be able to understand:

  • What the concept of “memory locality” is, and why it’s beneficial for performance
  • Why your intensive code using less memory (for the most part) directly results in a speed increase
  • Why buffers & native codegen are the best thing to happen to Roblox

so let’s get started


starting off: what exactly even is RAM, and how does it relate to memory?

To really truly understand these low level concepts, you need to be super familiar with what RAM exactly is. RAM stands for Random Access Memory, but that’s a bad name for it. A better way to think about it would be “Quick Access Memory” (because it’s faster than reading from your SSD/HDD). You can think of RAM like a massive array, where the numbers are stored as hexadecimal, and each element is one byte.

In this post, I will not be referring to RAM and memory as the same thing. Memory, in this post, will be referred to as any bit of data (regardless of where it physically is) that has an address and a value.

I am not sure how technically correct that terminology is, but that’s what I will be using in this post.


Accessing RAM is slooowwwww

Imagine that you are an accountant. In order to do your number crunching work, you need the companies financial information. But this is stored in a warehouse far away from you, and the warehouse workers are lazy. When you ask to get a box of papers on column 12 in aisle 4, it takes the slow workers a day to use an entire forklift to get a single box of papers, and deliver it to you.

That is the interaction that happens when the CPU accesses memory from RAM. It is slow. It wastes dozens of clock cycles, and it takes a long time. While CPU processing power has increased at a jaw-dropping rate, memory speeds have not.


CPU caching

The CPU is smart. It can tell what you want to access, and when you loop over an array, it knows what you’ll need. It knows that asking the RAM for memory is slow, and it knows that quite a large amount of memory can be transferred at a time. So instead of asking for just the memory your code needs to operate, the CPU asks for a large block of memory, and stores it in what’s called the “CPU cache”. There are 3 *levels to this cache, with the “Level 1” (L1) cache being the smallest & fastest to use, and the L3 cache being the largest but slowest level.

*note: On modern Intel chips, it’s not that simple anymore because they use a clever technique w/ caching that they dubbed “Smart Cache”, but that doesn’t really affect how you should think of CPU caching.


How CPU caching impacts your code

This is where the concept of “memory locality” comes in. When you iterate through an array, the array’s memory will be close, meaning it will be in your CPU cache. This means your CPU won’t have to access the RAM, which will be a direct performance benefit.

Twitter thread by @sleitnick showing this: https://twitter.com/sleitnick/status/1562067371770224640?lang=en


Important things to note in Luau

  • Tables are stored on the heap: that means you don’t really get to control if tables are stored next to each other.
  • I’m fairly certain that dictionaries are stored as hashmaps, which are stored continuously in memory
  • Most of the data you work with will very quickly end up in at least the L3 cache. Most computers have at least 10 megabytes of total cache memory (That is, L1, L2, and L3 combined), so most of the time you won’t ever deal with a situation where RAM is constantly being accessed directly.
  • Code using less memory = More space you have in the L1 cache. That’s a generalization you can stick with.

Okay cool, where do buffers come in, and how does it relate to native codegen?

Buffers are an array of bytes. That’s really awesome, because it means that we get to control how our memory is stored physically.

However there’s a problem: Buffers are done in the standard library. They have overhead while using them, basically. And that’s where native code generation comes in!

Native code generation basically turns your Luau code into machine code, which can be executed. This means that instead of being executed by C++ code, your Luau code gets executed directly. This also means there is zero overhead to using the buffer library, and it actually… is just as fast, if not faster, than tables. (this was posted in the OSS server) @Ruuuusty benchmarked matrix multiplication with buffers vs. with tables.

Interpreted mode


Native mode


You can clearly see the application of buffers. They offer more compact memory usage, on the server they are typically faster than tables (not in this example due to no resizing), and they allow you to control how you store your data in memory, which grants better memory locality. This means you can take better advantage of CPU caching.

Maybe I’ll upload another article about using buffers some day. Thanks for reading

Relevant links & resources so you can better understand this:
https://www.youtube.com/watch?v=N3o5yHYLviQ
https://youtu.be/247cXLkYt2M

36 Likes

Don’t your graphs show the buffers being slower than tables in both interpreted and native?..

1 Like

babe wake up new ffrostfall performance post just dropped

they do but it’s less memory usage for the same speed which means you’ll get better memory locality/less memory fragmenting