Create replacements for string.find and string.sub to overcome the UTF-8 bug

As a Roblox developer, it is currently impossible to get correct results for string.find and string.sub when there are UTF-8 characters inside, as already reported here:

Since you won’t be able to change the string library, create the same functions in the utf8 library, to fix these bugs:

  • utf8.sub (to fix string.sub)
  • utf8.find (to fix string.find)
4 Likes

What results do you want? The string library assumes a one byte character encoding (as the manual states, this isn’t really a bug), so it counts in byte offsets.

If you want to get a substring of codepoints, then that can be done with utf8.offset to get the byte offsets of the range you want. Here is an example of that:

local function codepoint_sub(s,i,j)
	return string.sub(
		s,
		utf8.offset(s,i) or if i > 0 then #s+1 else 0,
		if j == nil or j == -1 then #s else (utf8.offset(s,j+1) or if j > 0 then #s+1 else 0)-1
	)
end

However, a character may be encoded with multiple codepoints (such as ) in which case this function might split the character (e.g. codepoint_sub("z̀",1,1) would get z). If getting a substring with grapheme offsets is desired, that could be done by iterating through the string with utf8.graphemes grapheme by grapheme and concatenating all of the graphemes together to get the result.

What would you use a utf8.sub function for and what do you expect it to do?

As for utf8.find, it isn’t very clear what you want from such a function. It would be helpful if you could provide examples of why these functions would be useful.

4 Likes