As a Roblox developer, it is currently impossible to get correct results for string.find and string.sub when there are UTF-8 characters inside, as already reported here:
Since you won’t be able to change the string library, create the same functions in the utf8 library, to fix these bugs:
What results do you want? The string library assumes a one byte character encoding (as the manual states, this isn’t really a bug), so it counts in byte offsets.
If you want to get a substring of codepoints, then that can be done with utf8.offset to get the byte offsets of the range you want. Here is an example of that:
local function codepoint_sub(s,i,j)
return string.sub(
s,
utf8.offset(s,i) or if i > 0 then #s+1 else 0,
if j == nil or j == -1 then #s else (utf8.offset(s,j+1) or if j > 0 then #s+1 else 0)-1
)
end
However, a character may be encoded with multiple codepoints (such as z̀) in which case this function might split the character (e.g. codepoint_sub("z̀",1,1) would get z). If getting a substring with grapheme offsets is desired, that could be done by iterating through the string with utf8.graphemes grapheme by grapheme and concatenating all of the graphemes together to get the result.
What would you use a utf8.sub function for and what do you expect it to do?
As for utf8.find, it isn’t very clear what you want from such a function. It would be helpful if you could provide examples of why these functions would be useful.
function SubStrUTF8(str :string, IPos :number, FPos :number?) :string
local ct, i, j = 0
for StartByte, EndByte in utf8.graphemes(str) do
ct += 1
if ct == IPos then i = StartByte end
if ct == FPos then j = EndByte break end
end
if not i then return "" end
return string.sub(str, i, j)
end
not the best implementation but it gets the job done
SubStrUTF8(“áéíó”, 1, 1) → “á”
SubStrUTF8(“áéíó”, 1) → “áéíó”
SubStrUTF8(“áéíó”, 1, 3) → “áéí”
SubStrUTF8(“abcó”, 3, 4) → “có”
I’m using it to truncate user provided strings at some grapheme count
edit: this is probably a better implementation but i havented tested it too much
function SubStrUTF8(str :string, IPos :number, FPos :number?) :string
local i = utf8.offset(str, IPos)
if not i then return "" end
if not FPos then return string.sub(str, i) end
local j = utf8.offset(str, FPos + 1)
if j then j = j - 1 end
return string.sub(str, i, j)
end