Faster way of counting each character in a string?

I am trying to very efficiently generate a list that has the count of each character in a string, essentially looking like this for the string “Hello world!”

{
[" “] = 1,
[”!"] = 1,
[“H”] = 1,
[“d”] = 1,
[“e”] = 1,
[“l”] = 3,
[“o”] = 2,
[“r”] = 1,
[“w”] = 1
}

My current code looks like this;

local frequency = {}

for i, c in input:split("") do
	frequency[c] = (frequency[c] or 0) + 1
end

(also very interesting that the world input is blue)
It works at about 2 million character per second, which is not good enough for me.

Another solution I have tried to speed up the process is:

local frequency = {}

for c = 0, 255 do
	local char = string.char(c)
	local count = #input:split(char)-1

	if count > 0 then
		frequency[char] = count
	end
end

But that is much much slower, presumably because of how many times split is going through the string.

Another thing I have tried is instead of iterating through input:split() I iterate through 1, #len and use :sub, but it’s just a tiny bit slower on average.

Any suggestions would be appreciated.

I have bad news then. Lua might not be the language for you.

True, but I don’t see how I would get much better using c or something. Since the string:split() function is c api and its only working at 2 million characters per second, that’s the bottleneck (my code inside the loop is able to work about about 7 million characters per second).

Did you ever try string.gmatch function?

local Str = "Hello World!"
local Frequency = {}

for Character in string.gmatch(Str, ".") do
	if Frequency[Character] then
		Frequency[Character] += 1
	else
		Frequency[Character] = 1
	end
end

print(Frequency)

I haven’t touched C much yet so I couldn’t say for certain, but I’d imagine that you could do this a lot faster by just iterating over the bytes of the string, basically the same as substring but without all of the messy Lua function calls. Just a loop and a sequential memory read.

1 Like

Just tested it again, runs about 0.5 million slower then splitting the string (likely because of pattern matching)

I will test it out, along with utf8.graphemes

utf8.graphemes wont work for what I’m doing. The string is going to be a byte sequence, not a string of readable text.

Just for curiosity’s sake, since gmatch was only slightly slower, is there any benefit to using gsub instead? I wouldn’t think so but you might as well try it since we’re looking.

local Str = "Hello World!"
local Frequency = {}

Str:gsub(".", function(c) Frequency[c] = (Frequency[c] or 0) + 1 end)

My goodness, it’s running at 16 million characters per second (above my theoretical maximum of 7 million) I would of thought that since it’s calling the function very often it would be slower but apparently not.

1 Like

Glad to have helped! Here’s another idea I found while digging around if you want to give it a go as well. It’s supposed to be faster than substring.

local input = "Hello World!"
local frequency = {}
for i = 1, #input do
    local byte = string.byte(input, i)
    frequency[byte] = (frequency[byte] or 0) + 1
end

I have tried byte before, its about the same speed as sub.

1 Like

You guys sure?
Because I just benchmarked a randomized string of 16 million characters and here is my benchmark results:
image

Can you send your benchmark code?

Sure.

local Cache = {["."] = {}}

for Charcter = 0, 255 do
	Cache["."][Charcter+1] = string.char(Charcter)
end

function GetCharactersFromSet(CharSet: string)
	local Characters = {}
	for _, C in ipairs(Cache["."]) do
		if string.match(C, CharSet) then
			Characters[#Characters + 1] = C
		end
	end
	Cache[CharSet] = Characters
	return Characters
end

function Random(Length: number?, CharSet: string?)
	local Length = ((Length and Length > 0 and Length) or 20)
	local CharSet = ((CharSet and CharSet ~= "" and CharSet) or "[%w]")
	local CharPattern = (Cache[CharSet] or GetCharactersFromSet(CharSet))
	local MaxRange = #CharPattern
	local Randomized = ("")
	for _ = 1, Length do
		Randomized ..= CharPattern[math.random(1, MaxRange)]
	end
	return Randomized
end

------------------------------------------------------------------------------------------------|

function Benchmark(func, times, ...)
	local Start = os.clock()
	for _ = 1, (times or 1) do
		func(...)
	end
	local End = os.clock()
	return End - Start
end

local Str = Random(10000):rep(1600)

function GMatch()
	local Frequency = {}
	for Character in string.gmatch(Str, ".") do
		if Frequency[Character] then
			Frequency[Character] += 1
		else
			Frequency[Character] = 1
		end
	end
end

function Split()
	local Frequency = {}
	for Character in string.split(Str, "") do
		if Frequency[Character] then
			Frequency[Character] += 1
		else
			Frequency[Character] = 1
		end
	end
end

function Sub()
	local Frequency = {}
	for i = 1, #Str do
		local Char = string.sub(Str, i, i)
		Frequency[Char] = (Frequency[Char] or 0) + 1
	end
end

function GSub()
	local Frequency = {}
	Str:gsub(".", function(c) Frequency[c] = (Frequency[c] or 0) + 1 end)
end

function Byte()
	local Frequency = {}
	for i = 1, #Str do
		local byte = string.byte(Str, i, i)
		Frequency[byte] = (Frequency[byte] or 0) + 1
	end
end

warn(string.format("Split - Elapsed time: %.4f seconds", Benchmark(Split)))
task.wait(1)
warn(string.format("GMatch - Elapsed time: %.4f seconds", Benchmark(GMatch)))
task.wait(1)
warn(string.format("Sub - Elapsed time: %.4f seconds", Benchmark(Sub)))
task.wait(1)
warn(string.format("GSub - Elapsed time: %.4f seconds", Benchmark(GSub)))
task.wait(1)
warn(string.format("Byte - Elapsed time: %.4f seconds", Benchmark(Byte)))

Very interesting, I tried a few things and I still get the same results as you. I am still going to use gsub for what I am doing though because it’s working faster.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.