"string" library incorrect behavior with UTF-8 characters

rogeriodec_games · September 23, 2022, 5:59pm

Reproduction Steps

Run this:

print(string.find('abc', 'bc')) -- 2 3
print(string.find('ábc', 'bc')) -- 3 4

Also:

print(string.sub('ábc', 2)) -- �bc
print(string.sub('ábc', 3)) -- bc

Expected Behavior

The results should be the same, regardless there are UTF-8 characters or not.

Actual Behavior

The results are NOT be the same when there are UTF-8 characters

Issue Area: Engine
Issue Type: Other
Impact: High
Frequency: Constantly

Tiffblocks · September 23, 2022, 6:16pm

string.find returns a byte offset into the string, and the character á is encoded with multiple bytes. This is consistent with the other string APIs like string.sub, etc. The whole string library actually doesn’t know anything about Unicode, it treats strings like they’re an array of bytes.

To be more specific, the letter a is equivalent to string.char(0x61), while á is equivalent to string.char(0xc3, 0xa1).

There is a utf8 library that provides functions that work with Unicode in mind.

For example, you can use utf8.len(), which will return the number of codepoints in the string instead of the number of bytes.

rogeriodec_games · September 23, 2022, 6:23pm

In short, string library is buggy for UTF-8.
And there is no correlated string.find or string.sub in utf8 library.
So, I’ll have to create functions to overcome these bugs…

In time: for me, à is 1 character, just like a. It doesn’t matter how many technical explanations can be given.

rogeriodec_games · September 23, 2022, 6:35pm

I don’t know how to create a similar function that overcomes this UTF-8 limitation for string.sub.
Could you help me?

rogeriodec_games · September 23, 2022, 6:38pm

print(utf8.len('à')) -- 1
print(utf8.len('a')) -- 1

Now what?

tnavarts · September 23, 2022, 7:39pm

It depends what you want to do.

utf8.graphemes can be used to break the string down on visual grouping boundaries so that you can perform string.sub and other operations (basically tells you what index each visual character in the string begins and ends at).
utf8.nfcnormalize can make sure combining characters are represented in combined form so that string.find("a") will not find à (But string.find("b") would still find ̀b because there is no combined form of b-with-accent-grave since that’s not a standard character).

rogeriodec_games · September 23, 2022, 9:11pm

I managed to make a breakthrough with utf8.graphemes, but I still don’t understand utf8.nfcnormalize;
Could you give an example?

tnavarts · September 23, 2022, 9:17pm

The wikipedia page explains it well:

Basically, there’s multiple ways to form the character. A basic “a” character + a combining accent character, or a single “a+accent” character. NFC(ombined) / NFD normalization converts as many of the parts of the string to combined / uncombined form respectively as possible.

Combined form is more useful if you want find operations and what not to work naturally, because there is no basic “a” character anywhere in the string anymore.
Split form is more useful if you want to do something like strip out all of the accents to get plain ASCII text.

rogeriodec_games · September 23, 2022, 9:22pm

OK.
As for this bug, it is incomprehensible that this exists, since Roblox is multilingual and writing in languages that use accent is common.
I will be spending some hours of my work to overcome this bug.

tnavarts · September 23, 2022, 9:34pm

There’s no other reasonable path unfortunately.

Making the behavior of the string operations significantly different than standard Lua would cause a ton of issues for code portability. If you have suggestions for extending the utf8 library with additional functions that is possible though.

rogeriodec_games · September 23, 2022, 9:44pm

thirdtakeonit · October 3, 2022, 5:00am

This topic was automatically closed after 6 days. New replies are no longer allowed.