Fuzzy Matching and Levenshteins Algorithm

Yo! Recently, I’ve been diving deeper into algorithms and computational thinking when programming, and as a starting point, I started work on an autocomplete, when researching autocompletes, once again I’ve fallen behind @boatbomber 's footsteps, with soft matching and/or fuzzy matching. After researching that, I came across Levenshteins algorithm. For a little background, fuzzy matching (Directly from wikipedia, yeah) is

“The technique of finding strings that match a pattern approximately (rather than exactly)”

and Levenshteins Algorithm is

“The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other”

Example: How many changes would it take to transform “XOLT” into “BOLT”?

Jepoardy music

1, simply change X → B

And Levenshteins Algorithm is just a way of doing that calculation in code.

Use Cases: Autocomplete/Auto suggestions Spell Checking

Enough explaining, lets get to coding. (Links are at the bottom for far better explanations)

First, lets start off by creating our function

``````function LevenshteinDistance(strA: string, strB: string): number
--Input strings; Ouput distance
end
``````

Boom, now, for this function we need a 2D array, or a matrix. The lua docs has a nice section for that.

So outside of this function, lets just make a createMatrix function

``````local function newMatrix(N, M)
local mt = {}
for i = 0, N do
mt[i] = {}
for j = 0, M do
mt[i][j] = 0
end
end
return mt
end
``````

This’ll create a matrix full of 0s, N is the number of rows, M is the number of columns

Alright, why are we even using a matrix? Wikipedia has a geeky way to explain it

Computing the Levenshtein distance is based on the observation that if we reserve a matrix to hold the Levenshtein distances between all prefixes of the first string and all prefixes of the second, then we can compute the values in the matrix in a dynamic programming fashion

Yeah, yeah, in human terms, we’re holding the distance between each character in this matrix, visually, its like a table:
Each row holds each character of string A, while each column holds each character of B (or vice versa). We can easily find the number of rows and columns needed for our matrix by getting the length of each string, so lets do that.

``````function LevenshteinDistance(strA: string, strB: string): number
--Input strings; Ouput distance
local rows = string.len(strA)
local cols = string.len(strB)
if rows == 0 then
return cols
elseif cols == 0 then
return rows
elseif strA == strB then
return 0
end
end
``````

“Why that if chain?”
The logic behind that is simple, if either A or B is “” (length is 0), then the number of changes we need to do is equal to the length of the other string/ we need to do a change for each of the characters in order to change “” to for example “foo” is 3 (Add f, then add o then another o). So thats some easy logic out of the way. Then we just check if the two strings are the same, if so, then we don’t need to do any changes, meaning the distance is 0, and they’re as close as they can get.

Now we make a matrix, and initializing this matrix

``````function LevenshteinDistance(strA: string, strB: string): number
--Input strings; Ouput distance
local rows = string.len(strA)
local cols = string.len(strB)
if rows == 0 then
return cols
elseif cols == 0 then
return rows
elseif strA == strB then
return 0
end

local matrix = newMatrix(rows, cols)
for i = 1, rows do
matrix[i][0] = i
end
for j = 1, cols do
matrix[0][j] = j
end
end
``````

Visually it would look like this (Image from)

Next up, the heart of the big man’s algorithm, we loop through the tables, and find our distances

``````local function LevenshteinDistance(strA: string, strB: string): number
local rows = string.len(strA)
local cols = string.len(strB)
if rows == 0 then --Length of a is 0? The distance is just the length of b, because it would take that many substitutions to make b
return cols
elseif cols == 0 then
return rows
elseif strA == strB then --Same string? The distance is 0
return 0
end
local matrix = newMatrix(rows, cols) --Create a matrix or 2D array

--Initialize Matrix
for i = 1, rows do
matrix[i][0] = i
end
for j = 1, cols do
matrix[0][j] = j
end

for i = 1, rows do --If you initialized your matrix with 1s instead of 0s, start at 2 instead of 1 here
for j = 1, cols do
if (strA:byte(i) == strB:byte(i)) then
--If the two characters in both strings are the same
matrix[i][j] = matrix[i-1][j-1]
else
local insertion = matrix[i][j-1] + 1
local deletion = matrix[i - 1][j] + 1
local subsitution = matrix[i - 1][j - 1] + 1

matrix[i][j] = math.min(insertion, deletion, subsitution)
end

end
end

end
``````

Now what is all this? Following his algorithm, which I just realized I never showed:

That little statement at the bottom is that first if check in our code. If the two characters are the same, we set our current distance equal to the same distance as our previous check, if not we do the yellow highlighted portion. Which is our insertion, substitutions and deletions methods. (Since we’ve done the checks at the beginning, that first if statement isn’t needed, the same logic applied there is the logic in said if statement)
We then get the minimum out of the different possible changes, as in best case scenario and set the current index to it. According to Ls Algo, the distance between the two strings is the last calculated value, and lucky for us, its really easy to get and for whatever reason, we’ll also include the matrix, simply:

``````local function LevenshteinDistance(strA: string, strB: string): number
local rows = string.len(strA)
local cols = string.len(strB)
if rows == 0 then --Length of a is 0? The distance is just the length of b, because it would take that many substitutions to make b
return cols
elseif cols == 0 then
return rows
elseif strA == strB then --Same string? The distance is 0
return 0
end
local matrix = newMatrix(rows, cols) --Create a matrix or 2D array

--Initialize Matrix
for i = 1, rows do
matrix[i][0] = i
end
for j = 1, cols do
matrix[0][j] = j
end

for i = 1, rows do --If you initialized your matrix with 1s instead of 0s, start at 2 instead of 1 here
for j = 1, cols do
if (strA:byte(i) == strB:byte(i)) then
--If the two characters in both strings are the same
matrix[i][j] = matrix[i-1][j-1]
else
local insertion = matrix[i][j-1] + 1
local deletion = matrix[i - 1][j] + 1
local subsitution = matrix[i - 1][j - 1] + 1

matrix[i][j] = math.min(insertion, deletion, subsitution)
end

end
end
return matrix[rows][cols], matrix
end
``````

I’m terrible at explaining, but I thought just throwing this out with no explanation wouldn’t be best, so if I messed up anything, or explained anything wrong, please correct me. If you want a far better explanation, a few links are below. Nonetheless, the code works, and you can use it at your full discretion.

Full Code
``````local function newMatrix(N, M)
local mt = {}
for i = 0, N do
mt[i] = {}
for j = 0, M do
mt[i][j] = 0
end
end
return mt
end
local function LevenshteinDistance(strA: string, strB: string): number
local rows = string.len(strA)
local cols = string.len(strB)
if rows == 0 then --Length of a is 0? The distance is just the length of b, because it would take that many substitutions to make b
return cols
elseif cols == 0 then
return rows
elseif strA == strB then --Same string? The distance is 0
return 0
end
local matrix = newMatrix(rows, cols) --Create a matrix or 2D array

--Initialize Matrix
for i = 1, rows do
matrix[i][0] = i
end
for j = 1, cols do
matrix[0][j] = j
end

for i = 1, rows do --If you initialized your matrix with 1s instead of 0s, start at 2 instead of 1 here
for j = 1, cols do
if (strA:byte(i) == strB:byte(i)) then
--If the two characters in both strings are the same
matrix[i][j] = matrix[i-1][j-1]
else
local insertion = matrix[i][j-1] + 1
local deletion = matrix[i - 1][j] + 1
local subsitution = matrix[i - 1][j - 1] + 1

matrix[i][j] = math.min(insertion, deletion, subsitution)
end

end
end
return matrix[rows][cols], matrix
end
``````

And you can test your code using a Levenshtein Calculator online, and running a few test strings through the function.

15 Likes

oh nice, I made Levenshtein distance before

I’m happy someone else tried it out