Why does the order of a nested for loop change the speed of execution?

DrKittyWaffles · February 21, 2020, 10:03am

Preface

I noticed when writing code involving nesting loops that it was always faster to iterate over the smaller loop with the larger one inside of that rather than the other way around.

Where x < y in both cases

Case 1: Runs faster

for i = 1, x do
    for j = 1, y do

    end
end

Case 2: Runs slower

for i = 1, y do
    for j = 1, x do

    end
end

Example

In this examples I will be iterating two loops, one from [1, 100] and the other to [1, 1000000] and I will showcase how the time until completion is different.

do
	local start = os.clock()

	for i = 1, 100 do
		for j = 1, 1000000 do

		end
	end

	print(os.clock() - start)
end

do
	local start = os.clock()

	for i = 1, 1000000 do
		for j = 1, 100 do

		end
	end

	print(os.clock() - start)
end

The output yields:

0.674031
0.754308

Showing that the first case is indeed faster than the second. And so, my question is: Why does this occur?

The_PieDude · February 21, 2020, 10:16am

The extra time in the second example most likely comes from the script having to initialize the j variable and the for loop 1 million times whilst in the first example it only has to do that 100 times

DrKittyWaffles · February 21, 2020, 10:17am

Yeah but in the first snippet I’m declaring i 10^6 times instead of j, and the other way around for the second. So I don’t think that is the right answer…?

The_PieDude · February 21, 2020, 10:21am

the variable i is only created once for the first for loop in both examples
the for loop then creates another for loop which creates another variable, the length of the first for loop determines how often it has to create that variable/loop

DrKittyWaffles · February 21, 2020, 10:29am

Someone said I got something wrong here but I am not entirely sure so I’ll just cross it out…
For loops create a “new” temporary variable each iteration so there would be 10^6 new declarations of i temporarily, and the same for j in the second case. And so the sum of the amount of temporary variables created would be the same.

On the note of my question, someone (@jakedies) in a Discord server I’m in suggested that it is because the “FORPREP” opcode would be called more often. In the first case it would be only called x + 1 times (101) while in the second it would be called y + 1 times (1000001)

http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf

So if I have a nested loop the FORPREP opcode is called once to initialize the outer loop, and then is called x more times where x represents the amount of iterations the outer loop makes because each iteration of the outer loop makes a new loop.

I might’ve interpreted the information wrong but if I interpreted it correctly then I conclude that the cause of the problem stems from the amount of times an opcode is called.

Raviates · February 21, 2020, 10:32am

You can’t benchmark like this and expect reliable results, the noise will wash away any valuable information. Try running multiple iterations of this and then report back if the gap is still present and wide.

DrKittyWaffles · February 21, 2020, 10:34am

I thought it was obvious I ran more than one trial, my bad. I ran 10 trials

Case 2 reliability ran slower than case 1. Only once out of 10 trials did case 2 run faster than case 1 and the difference in time was roughly 0.0001

The_PieDude · February 21, 2020, 10:44am

It does indeed seem that it a FORPREP creates the internal value (the one I was referencing)

It also creates a external upvalue to support doing stuff like this

In conclusion Its most likely the extra FORPREP calls that you mention that cause the slowdown.

jakedies · February 21, 2020, 7:52pm

What makes this slow is not the “creation” of a local variable (and it is not an upvalue). What makes this slow is a few type checks, initial initialization of the internal index (the counter), and an extra execution of the FORLOOP instruction (see body of case OP_FORPREP in lvm.c, and an extra execution of the FORLOOP instruction which terminates the loop. The FORLOOP instruction will be executed 2 + n times (if n > 0 where n is the number of iterations) and just once otherwise. The additional 2 times comes from it being jumped to in FORPREP and at the end (when the counter passes the limit) that actually terminates the loop.

Case 1: 1 + x FORPREP and 2 + x + (2 + y)*x FORLOOP = 2 + 3x + xy
Case 2: 1 + y FORPREP and 2 + y + (2 + x)*y FORLOOP = 2 + 3y + xy

If we subtract the number of times the FORLOOP instruction is executed by common terms (2 and xy) you end up with the important difference: 3x in the first case vs 3y in the second case. So it’s not only FORPREP being executed y - x more times, it’s also FORLOOP being executed 3(y - x) more times due to both the additional jump to that instruction by FORPREP and the terminating case. So I was not entirely correct when I said it was just the FORPREP.

BetterThanReal · April 2, 2020, 2:33am

Also consider that CPU instruction caching could possibly contribute to some of the difference in performance. Depending on how the loops are represented in machine code, the CPU might be throwing away its cached copy of the inner loop instructions, every time the outer loop executes. Invalidated cache misses can cause significant performance degradation for CPU instructions.

jakedies · April 2, 2020, 11:13pm

Lua’s too abstracted for that to be something worth considering. The inner loop instructions aren’t gonna exist in the CPU caches, the various instructions in luaV_execute might be.

BetterThanReal · April 3, 2020, 4:00am

I disagree, Lua’s pretty close to the metal, which is why it’s a great choice for small, embedded engines, like dynamic scripting in game engines. Judging by your analysis of Lua opcodes, you’re probably correct in asserting that the main performance issue at play relates to the number of times certain opcodes are called.

However; Loop optimization has been studied for decades, and has resulted in some pretty intelligent, and non-obvious, optimizations. For a quick reference, check out the topics below from the following articles:

Locality of reference with regarding to caching and data translation look-aside buffer:

Cache-oriented loop optimizations, and loop tiling:
https://www.sciencedirect.com/topics/computer-science/cache-behavior

Loop interchange:

Loop nest optimization: (Refer to the reference for The cache performance and optimizations of blocked algorithms

Loop fusion and fission: (Notable for its application of data locality)

Scalable locality: (specifically uses nested looping as an example)

Impact of loop optimization upon different programming languages and platforms:

jakedies · April 3, 2020, 7:51am

No it’s not, the Lua code you write is executed by the Lua Virtual Machine, which is just a C program. That’s a big abstraction. It’s virtually impossible to take advantage of most of those optimizations. Lua is too far away, your branch in Lua probably results in at least a dozen or so actual branches as well as a ton of book-keeping. The only safe assumption you can make is the Lua stack is likely to be cached, but it’s already common knowledge that accessing and mutating locals is faster than anything in a table (which also means globals).

Edit: I forgot to mention, it’s great for embedded devices because it’s very portable (the standard compiler and VM is written in C89/C99) and has very low memory overhead when compared to most other scripting languages. It’s also very fast compared to many of those languages, mostly because the language (and corresponding pcode) is so simple that the VM really doesn’t do all that much every step (in comparison to other languages). Lua is fast and small, but it’s not “close enough” to the hardware where you can usefully take advantage of optimizations such as caching, branch prediction, etc.

BetterThanReal · April 3, 2020, 3:17pm

I guess “close to the metal” is a subjective term, and as you’ve pointed out, Lua’s VM is much more efficiently implemented than the VMs of some other languages. In any event, even if the loops themselves are perpetually facing cache misses, some feel that Lua cache-hit optimization is worth considering as a potentially impactful way to improve performance. E.g. analysis of Lua cache misses on Xbox architecture: