Create replacements for string.find and string.sub to overcome the UTF-8 bug

rogeriodec_games · September 23, 2022, 9:44pm

As a Roblox developer, it is currently impossible to get correct results for string.find and string.sub when there are UTF-8 characters inside, as already reported here:

Since you won’t be able to change the string library, create the same functions in the utf8 library, to fix these bugs:

utf8.sub (to fix string.sub)
utf8.find (to fix string.find)

Halalaluyafail3 · October 1, 2022, 2:14am

What results do you want? The string library assumes a one byte character encoding (as the manual states, this isn’t really a bug), so it counts in byte offsets.

If you want to get a substring of codepoints, then that can be done with utf8.offset to get the byte offsets of the range you want. Here is an example of that:

local function codepoint_sub(s,i,j)
	return string.sub(
		s,
		utf8.offset(s,i) or if i > 0 then #s+1 else 0,
		if j == nil or j == -1 then #s else (utf8.offset(s,j+1) or if j > 0 then #s+1 else 0)-1
	)
end

However, a character may be encoded with multiple codepoints (such as z̀) in which case this function might split the character (e.g. codepoint_sub("z̀",1,1) would get z). If getting a substring with grapheme offsets is desired, that could be done by iterating through the string with utf8.graphemes grapheme by grapheme and concatenating all of the graphemes together to get the result.

What would you use a utf8.sub function for and what do you expect it to do?

As for utf8.find, it isn’t very clear what you want from such a function. It would be helpful if you could provide examples of why these functions would be useful.

CarefreeCarrot · July 18, 2024, 10:51pm

function SubStrUTF8(str :string, IPos :number, FPos :number?) :string
	local ct, i, j = 0
	for StartByte, EndByte in utf8.graphemes(str) do
		ct += 1
		if ct == IPos then i = StartByte end
		if ct == FPos then j = EndByte break end
	end
	if not i then return "" end
	return string.sub(str, i, j)
end

not the best implementation but it gets the job done
SubStrUTF8(“áéíó”, 1, 1) → “á”
SubStrUTF8(“áéíó”, 1) → “áéíó”
SubStrUTF8(“áéíó”, 1, 3) → “áéí”
SubStrUTF8(“abcó”, 3, 4) → “có”

I’m using it to truncate user provided strings at some grapheme count

edit: this is probably a better implementation but i havented tested it too much

function SubStrUTF8(str :string, IPos :number, FPos :number?) :string
	local i = utf8.offset(str, IPos)
	if not i then return "" end
	if not FPos then return string.sub(str, i) end
	local j = utf8.offset(str, FPos + 1)
	if j then j = j - 1 end
	return string.sub(str, i, j)
end