Fuzzy Matching and Levenshteins Algorithm

Yo! Recently, I’ve been diving deeper into algorithms and computational thinking when programming, and as a starting point, I started work on an autocomplete, when researching autocompletes, once again I’ve fallen behind @boatbomber 's footsteps, with soft matching and/or fuzzy matching. After researching that, I came across Levenshteins algorithm. For a little background, fuzzy matching (Directly from wikipedia, yeah) is

“The technique of finding strings that match a pattern approximately (rather than exactly)”

and Levenshteins Algorithm is

“The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other”

Example: How many changes would it take to transform “XOLT” into “BOLT”?

Jepoardy music

1, simply change X → B

And Levenshteins Algorithm is just a way of doing that calculation in code.

Use Cases:
Autocomplete/Auto suggestions
Spell Checking


Enough explaining, lets get to coding. (Links are at the bottom for far better explanations)

First, lets start off by creating our function

function LevenshteinDistance(strA: string, strB: string): number
    --Input strings; Ouput distance
end

Boom, now, for this function we need a 2D array, or a matrix. The lua docs has a nice section for that.

So outside of this function, lets just make a createMatrix function

local function newMatrix(N, M)
	local mt = {}
	for i = 0, N do
		mt[i] = {}
		for j = 0, M do
			mt[i][j] = 0
		end
	end
	return mt
end

This’ll create a matrix full of 0s, N is the number of rows, M is the number of columns

Alright, why are we even using a matrix? Wikipedia has a geeky way to explain it

Computing the Levenshtein distance is based on the observation that if we reserve a matrix to hold the Levenshtein distances between all prefixes of the first string and all prefixes of the second, then we can compute the values in the matrix in a dynamic programming fashion

Yeah, yeah, in human terms, we’re holding the distance between each character in this matrix, visually, its like a table:
image Each row holds each character of string A, while each column holds each character of B (or vice versa). We can easily find the number of rows and columns needed for our matrix by getting the length of each string, so lets do that.

function LevenshteinDistance(strA: string, strB: string): number
    --Input strings; Ouput distance
    local rows = string.len(strA)
    local cols = string.len(strB)
    if rows == 0 then 
        return cols 
    elseif cols == 0 then 
        return rows
    elseif strA == strB then 
        return 0 
    end
end

“Why that if chain?”
The logic behind that is simple, if either A or B is “” (length is 0), then the number of changes we need to do is equal to the length of the other string/ we need to do a change for each of the characters in order to change “” to for example “foo” is 3 (Add f, then add o then another o). So thats some easy logic out of the way. Then we just check if the two strings are the same, if so, then we don’t need to do any changes, meaning the distance is 0, and they’re as close as they can get.

Now we make a matrix, and initializing this matrix

function LevenshteinDistance(strA: string, strB: string): number
    --Input strings; Ouput distance
    local rows = string.len(strA)
    local cols = string.len(strB)
    if rows == 0 then 
        return cols 
    elseif cols == 0 then 
        return rows
    elseif strA == strB then 
        return 0 
    end

    local matrix = newMatrix(rows, cols)
    for i = 1, rows do
		matrix[i][0] = i
	end
	for j = 1, cols do
		matrix[0][j] = j
	end
end

Visually it would look like this (Image from)

Next up, the heart of the big man’s algorithm, we loop through the tables, and find our distances

local function LevenshteinDistance(strA: string, strB: string): number
	local rows = string.len(strA) 
	local cols = string.len(strB) 
	if rows == 0 then --Length of a is 0? The distance is just the length of b, because it would take that many substitutions to make b
		return cols
	elseif cols == 0 then 
		return rows
	elseif strA == strB then --Same string? The distance is 0
		return 0
	end
	local matrix = newMatrix(rows, cols) --Create a matrix or 2D array
	
	--Initialize Matrix
	for i = 1, rows do
		matrix[i][0] = i
	end
	for j = 1, cols do
		matrix[0][j] = j
	end
	
	for i = 1, rows do --If you initialized your matrix with 1s instead of 0s, start at 2 instead of 1 here
		for j = 1, cols do
			if (strA:byte(i) == strB:byte(i)) then
				--If the two characters in both strings are the same
				matrix[i][j] = matrix[i-1][j-1]
			else
				local insertion = matrix[i][j-1] + 1
				local deletion = matrix[i - 1][j] + 1
				local subsitution = matrix[i - 1][j - 1] + 1

				matrix[i][j] = math.min(insertion, deletion, subsitution)
			end


		end
	end
	
end

Now what is all this? Following his algorithm, which I just realized I never showed:

That little statement at the bottom is that first if check in our code. If the two characters are the same, we set our current distance equal to the same distance as our previous check, if not we do the yellow highlighted portion. Which is our insertion, substitutions and deletions methods. (Since we’ve done the checks at the beginning, that first if statement isn’t needed, the same logic applied there is the logic in said if statement)
We then get the minimum out of the different possible changes, as in best case scenario and set the current index to it. According to Ls Algo, the distance between the two strings is the last calculated value, and lucky for us, its really easy to get and for whatever reason, we’ll also include the matrix, simply:

local function LevenshteinDistance(strA: string, strB: string): number
	local rows = string.len(strA) 
	local cols = string.len(strB) 
	if rows == 0 then --Length of a is 0? The distance is just the length of b, because it would take that many substitutions to make b
		return cols
	elseif cols == 0 then 
		return rows
	elseif strA == strB then --Same string? The distance is 0
		return 0
	end
	local matrix = newMatrix(rows, cols) --Create a matrix or 2D array
	
	--Initialize Matrix
	for i = 1, rows do
		matrix[i][0] = i
	end
	for j = 1, cols do
		matrix[0][j] = j
	end
	
	for i = 1, rows do --If you initialized your matrix with 1s instead of 0s, start at 2 instead of 1 here
		for j = 1, cols do
			if (strA:byte(i) == strB:byte(i)) then
				--If the two characters in both strings are the same
				matrix[i][j] = matrix[i-1][j-1]
			else
				local insertion = matrix[i][j-1] + 1
				local deletion = matrix[i - 1][j] + 1
				local subsitution = matrix[i - 1][j - 1] + 1

				matrix[i][j] = math.min(insertion, deletion, subsitution)
			end


		end
	end
	return matrix[rows][cols], matrix
end

I’m terrible at explaining, but I thought just throwing this out with no explanation wouldn’t be best, so if I messed up anything, or explained anything wrong, please correct me. If you want a far better explanation, a few links are below. Nonetheless, the code works, and you can use it at your full discretion.

Full Code
local function newMatrix(N, M)
	local mt = {}
	for i = 0, N do
		mt[i] = {}
		for j = 0, M do
			mt[i][j] = 0
		end
	end
	return mt
end
local function LevenshteinDistance(strA: string, strB: string): number
	local rows = string.len(strA) 
	local cols = string.len(strB) 
	if rows == 0 then --Length of a is 0? The distance is just the length of b, because it would take that many substitutions to make b
		return cols
	elseif cols == 0 then 
		return rows
	elseif strA == strB then --Same string? The distance is 0
		return 0
	end
	local matrix = newMatrix(rows, cols) --Create a matrix or 2D array
	
	--Initialize Matrix
	for i = 1, rows do
		matrix[i][0] = i
	end
	for j = 1, cols do
		matrix[0][j] = j
	end
	
	for i = 1, rows do --If you initialized your matrix with 1s instead of 0s, start at 2 instead of 1 here
		for j = 1, cols do
			if (strA:byte(i) == strB:byte(i)) then
				--If the two characters in both strings are the same
				matrix[i][j] = matrix[i-1][j-1]
			else
				local insertion = matrix[i][j-1] + 1
				local deletion = matrix[i - 1][j] + 1
				local subsitution = matrix[i - 1][j - 1] + 1

				matrix[i][j] = math.min(insertion, deletion, subsitution)
			end


		end
	end
	return matrix[rows][cols], matrix
end

And you can test your code using a Levenshtein Calculator online, and running a few test strings through the function.

Other Resources:
Levenshtein Distance for beginners
The Levenshtein distance algorithm
Wikipedia | Levenshtein Distance

15 Likes

oh nice, I made Levenshtein distance before

I’m happy someone else tried it out
I made mine without making a matrix, I instead made it using recursion alone
great job tho!

Nice! I’ll have to look into that method, thanks for that.

1 Like

np, if you need any help just tell me

1 Like