# Why does the order of a nested for loop change the speed of execution?

### Preface

I noticed when writing code involving nesting loops that it was always faster to iterate over the smaller loop with the larger one inside of that rather than the other way around.

Where `x < y` in both cases

Case 1: Runs faster

``````for i = 1, x do
for j = 1, y do

end
end
``````

Case 2: Runs slower

``````for i = 1, y do
for j = 1, x do

end
end
``````

### Example

In this examples I will be iterating two loops, one from [1, 100] and the other to [1, 1000000] and I will showcase how the time until completion is different.

``````do
local start = os.clock()

for i = 1, 100 do
for j = 1, 1000000 do

end
end

print(os.clock() - start)
end

do
local start = os.clock()

for i = 1, 1000000 do
for j = 1, 100 do

end
end

print(os.clock() - start)
end
``````

The output yields:

``````0.674031
0.754308
``````

Showing that the first case is indeed faster than the second. And so, my question is: Why does this occur?

2 Likes

The extra time in the second example most likely comes from the script having to initialize the j variable and the for loop 1 million times whilst in the first example it only has to do that 100 times

Yeah but in the first snippet Iâ€™m declaring `i` 10^6 times instead of `j`, and the other way around for the second. So I donâ€™t think that is the right answerâ€¦?

the variable i is only created once for the first for loop in both examples
the for loop then creates another for loop which creates another variable, the length of the first for loop determines how often it has to create that variable/loop

Someone said I got something wrong here but I am not entirely sure so Iâ€™ll just cross it outâ€¦
For loops create a â€śnewâ€ť temporary variable each iteration so there would be `10^6` new declarations of `i` temporarily, and the same for `j` in the second case. And so the sum of the amount of temporary variables created would be the same.

On the note of my question, someone (@jakedies) in a Discord server Iâ€™m in suggested that it is because the â€śFORPREPâ€ť opcode would be called more often. In the first case it would be only called `x + 1` times (101) while in the second it would be called `y + 1` times (1000001)

So if I have a nested loop the `FORPREP` opcode is called once to initialize the outer loop, and then is called `x` more times where `x` represents the amount of iterations the outer loop makes because each iteration of the outer loop makes a new loop.

I mightâ€™ve interpreted the information wrong but if I interpreted it correctly then I conclude that the cause of the problem stems from the amount of times an opcode is called.

2 Likes

You canâ€™t benchmark like this and expect reliable results, the noise will wash away any valuable information. Try running multiple iterations of this and then report back if the gap is still present and wide.

I thought it was obvious I ran more than one trial, my bad. I ran 10 trials

Case 2 reliability ran slower than case 1. Only once out of 10 trials did case 2 run faster than case 1 and the difference in time was roughly `0.0001`

1 Like

It does indeed seem that it a FORPREP creates the internal value (the one I was referencing)

It also creates a external upvalue to support doing stuff like this

In conclusion Its most likely the extra FORPREP calls that you mention that cause the slowdown.

2 Likes

What makes this slow is not the â€ścreationâ€ť of a local variable (and it is not an upvalue). What makes this slow is a few type checks, initial initialization of the internal index (the counter), and an extra execution of the `FORLOOP` instruction (see body of `case OP_FORPREP` in lvm.c, and an extra execution of the `FORLOOP` instruction which terminates the loop. The `FORLOOP` instruction will be executed `2 + n` times (if n > 0 where `n` is the number of iterations) and just once otherwise. The additional 2 times comes from it being jumped to in `FORPREP` and at the end (when the counter passes the limit) that actually terminates the loop.

Case 1: `1 + x FORPREP` and `2 + x + (2 + y)*x FORLOOP` = `2 + 3x + xy`
Case 2: `1 + y FORPREP` and `2 + y + (2 + x)*y FORLOOP` = `2 + 3y + xy`

If we subtract the number of times the `FORLOOP` instruction is executed by common terms (`2` and `xy`) you end up with the important difference: `3x` in the first case vs `3y` in the second case. So itâ€™s not only `FORPREP` being executed `y - x` more times, itâ€™s also `FORLOOP` being executed `3(y - x)` more times due to both the additional jump to that instruction by `FORPREP` and the terminating case. So I was not entirely correct when I said it was just the `FORPREP`.

3 Likes

Also consider that CPU instruction caching could possibly contribute to some of the difference in performance. Depending on how the loops are represented in machine code, the CPU might be throwing away its cached copy of the inner loop instructions, every time the outer loop executes. Invalidated cache misses can cause significant performance degradation for CPU instructions.

Luaâ€™s too abstracted for that to be something worth considering. The inner loop instructions arenâ€™t gonna exist in the CPU caches, the various instructions in `luaV_execute` might be.

I disagree, Luaâ€™s pretty close to the metal, which is why itâ€™s a great choice for small, embedded engines, like dynamic scripting in game engines. Judging by your analysis of Lua opcodes, youâ€™re probably correct in asserting that the main performance issue at play relates to the number of times certain opcodes are called.

However; Loop optimization has been studied for decades, and has resulted in some pretty intelligent, and non-obvious, optimizations. For a quick reference, check out the topics below from the following articles:

Locality of reference with regarding to caching and data translation look-aside buffer:

Cache-oriented loop optimizations, and loop tiling:
https://www.sciencedirect.com/topics/computer-science/cache-behavior

Loop interchange:

Loop nest optimization: (Refer to the reference for The cache performance and optimizations of blocked algorithms

Loop fusion and fission: (Notable for its application of data locality)

Scalable locality: (specifically uses nested looping as an example)

Impact of loop optimization upon different programming languages and platforms:

No itâ€™s not, the Lua code you write is executed by the Lua Virtual Machine, which is just a C program. Thatâ€™s a big abstraction. Itâ€™s virtually impossible to take advantage of most of those optimizations. Lua is too far away, your branch in Lua probably results in at least a dozen or so actual branches as well as a ton of book-keeping. The only safe assumption you can make is the Lua stack is likely to be cached, but itâ€™s already common knowledge that accessing and mutating locals is faster than anything in a table (which also means globals).

Edit: I forgot to mention, itâ€™s great for embedded devices because itâ€™s very portable (the standard compiler and VM is written in C89/C99) and has very low memory overhead when compared to most other scripting languages. Itâ€™s also very fast compared to many of those languages, mostly because the language (and corresponding pcode) is so simple that the VM really doesnâ€™t do all that much every step (in comparison to other languages). Lua is fast and small, but itâ€™s not â€śclose enoughâ€ť to the hardware where you can usefully take advantage of optimizations such as caching, branch prediction, etc.

1 Like

I guess â€śclose to the metalâ€ť is a subjective term, and as youâ€™ve pointed out, Luaâ€™s VM is much more efficiently implemented than the VMs of some other languages. In any event, even if the loops themselves are perpetually facing cache misses, some feel that Lua cache-hit optimization is worth considering as a potentially impactful way to improve performance. E.g. analysis of Lua cache misses on Xbox architecture:

1 Like