Need help with understanding a Linear Regression model implementation

As I was searching for an example script to help me understand on how to implement Linear Regression, I saw this:

local data = { -- Age and BG Levels
    {25, 75},
    {27, 70},
    {30, 78},
    {33, 90},
    {40, 100},
    {50, 120},
    {52, 110},
    {54, 106},
    {60, 120},
    {75, 122},
    {45, 93},
    {110, 130},
    {155, 167}
}

local XSum = 0
local YSum = 0
local XYSum = 0
local XXSum = 0
local YYSum = 0

for _, v in pairs (data) do
    XSum = XSum + v[1]
    XXSum = XXSum + v[1]^2
    
    YSum = YSum + v[2]
    YYSum = YYSum + v[2]^2
    
    XYSum = XYSum + v[1]*v[2]
end

local div = (#data*XXSum)-(XSum^2)
local a = ((YSum*XXSum)-( XSum*XYSum )) / div
local b = (( #data*XYSum ) - ( XSum*YSum )) / div
print(a, b, div .."\n")

for x = 20, 100 do
    print("Age: ".. x, "BG: ".. math.floor((a + (b*x))*100 ) /100)
end

The problem with this script, is that I do not understand this part of the code:

```
local div = (#data*XXSum)-(XSum^2)
local a = ((YSum*XXSum)-( XSum*XYSum )) / div
local b = (( #data*XYSum ) - ( XSum*YSum )) / div
```

Can someone explain to me what those formulas are about? I was only aware of this one:

y = b0 + b1*x + e

It’s the formula to solve for the slope and intercept of the linear regression line. This page explains
how to use it:

As for where it comes from, it comes from taking the equation of a line and assigning a cost for every point’s distance to the line. Specifically, squared distance to the line. You then find the slope and intercept that minimize these costs which means finding where the derivative of the cost is zero. You can read the details here:

1 Like

I dont really like the form of the equation in the script and I much prefer a different form, so Ill attempt to explain it using the other form (even this form I dont like the b equation, it makes no sense and is just confusing, I will be simplifying the equation to make sense of it)

(note that a is the equivalent of b0 and b is the equivalent of b1, its just different formatting)

So the first one makes sense, predicted Y = y-intercept + slope*X, its just Y=mx+b

The second one is definitely less clear, but you notice a few things right away

The top and bottom of the fraction are of similar formats:

N * Σ(X*Y) - Σ(X) * Σ(Y)
and
N * Σ(X*X) - Σ(X) * Σ(X)

Keep in mine what slope is, rise/run, y/x

If you mess with the equation a tiny bit…


… You can wrangle it down to a more sensical form

This being the final formula:

Keep in mind this fact: The LSRL will always pass through the mean

So, (Σ X)/N is just the average X vaule
Therefore…
X - (Σ X)/N is the distance from the mean

Im going to analyze each half of the fraction independently then combine it later
Think of the upper half, containing the Y, as a Y datapoint in a set (before its summed)
Think of the lower half, containing the X, as an X datapoint in a set (before its summed)
Now imagine a random dataset on a scatterplot, then imagine the upper and lower half equations of them graphed onto a graph
image

The blue dots are the random dataset
The red dots are the upper and lower half of the equation (the upper half is the y value for each given point, the lower half is the x value)

Now imagine the single point resulting from the sum of the X and Y values
image
Thats what the yellow point is, the sum of all of the red points’ positions
You can see how a line drawn from (0,0) to the yellow point is parallel to the LSRL
image
So the slope of this line would just be Y/X obviously, and since two parallel lines have the same slope thats your final slope, b:
Y position of yellow point / X position of yellow point
Which is exactly what the formula tells us

The reason this creates a slope which minimizes the distance to the line I really dont fully understand, and whoever made this formula is a genius, but when you multiply the X and Y values by the X distance to the mean, youre making both proportional to its own X distance from the mean
A point further from the mean will have a larger X and Y value, having more influence over the line because it has more influence over the final value of the sum, and a point closer to the mean in the X direction will have less influence over the final value of the sum because its smaller

This is very similar to finding the center of mass of an object in physics for example, where you take the distance to some point and multiply it by the mass at that point, then sum it all together

This is the best intuitive description of this formula I can give at this point, but I hope its more explanation than just some massive ugly formula staring you down

Onto the next formula:

This can be rewritten very easily to make more sense (it wont be like 8 steps of simplification this time, only 1)

This is really much simpler than the last formula if you really think about

I will think about this one using unit analysis

So like I said earlier, slope (b) is just y/x
So, by multiplying b by the average X value, the “x” in the unit cancels out (y/x * x = y), and you get a y value
This y value, b*(Σ X)/N represents the Y value of the function at the mean if the LSRL had a y-intercept of 0, as shown in this image:


So, by multiplying the x component of the data average by the predicted slope, you can get the predicted y average, which is obviously quite a bit off from the real y average
In order to get the y intercept such that the line passes through the REAL center point, you simply move the actual y component of the average down by the predicted y average and move it over to the y axis (aka you subtract it)

You could also derive this formula by rearranging equation (1), the first equation, solving for a

Y=a+bX
a=Y-bX

That was a long post, but hopefully you understand a bit more about LSRLs now (I sure do)
Sorry I couldnt give a full explanation on the b formula but hopefully that gives you a decent idea about what it actually is

3 Likes

I have absolutely no idea how this math thing works, but I know what is happenig here.

Putting a ‘#’ in front of a table variable will return the length of the table. We would add parentheses, and inside them add math equation. The script will solve the math equations, and the equation inside the parentheses will return an answer. Any other operator, like dividing by 2, outside the parentheses means the answer of the equation divided by, subtracted by, etc. (btw sorry if im confusing)

ex.)

local var = ((2 + 2) - (8+1)) / 2

((4) - (9)) / 2

(-5 ) / 2

-2.5

hopes this helps give at least a basic understanding (if this is the answer your looking for)

1 Like