"string" library incorrect behavior with UTF-8 characters

Reproduction Steps

Run this:

print(string.find('abc', 'bc')) -- 2 3
print(string.find('ábc', 'bc')) -- 3 4

Also:

print(string.sub('ábc', 2)) -- �bc
print(string.sub('ábc', 3)) -- bc

Expected Behavior

The results should be the same, regardless there are UTF-8 characters or not.

Actual Behavior

The results are NOT be the same when there are UTF-8 characters

Issue Area: Engine
Issue Type: Other
Impact: High
Frequency: Constantly

2 Likes

string.find returns a byte offset into the string, and the character á is encoded with multiple bytes. This is consistent with the other string APIs like string.sub, etc. The whole string library actually doesn’t know anything about Unicode, it treats strings like they’re an array of bytes.

To be more specific, the letter a is equivalent to string.char(0x61), while á is equivalent to string.char(0xc3, 0xa1).

There is a utf8 library that provides functions that work with Unicode in mind.

For example, you can use utf8.len(), which will return the number of codepoints in the string instead of the number of bytes.

In short, string library is buggy for UTF-8.
And there is no correlated string.find or string.sub in utf8 library.
So, I’ll have to create functions to overcome these bugs…

In time: for me, à is 1 character, just like a. It doesn’t matter how many technical explanations can be given.

I don’t know how to create a similar function that overcomes this UTF-8 limitation for string.sub.
Could you help me?

print(utf8.len('à')) -- 1
print(utf8.len('a')) -- 1

Now what?

It depends what you want to do.

  • utf8.graphemes can be used to break the string down on visual grouping boundaries so that you can perform string.sub and other operations (basically tells you what index each visual character in the string begins and ends at).

  • utf8.nfcnormalize can make sure combining characters are represented in combined form so that string.find("a") will not find à (But string.find("b") would still find ̀b because there is no combined form of b-with-accent-grave since that’s not a standard character).

1 Like

I managed to make a breakthrough with utf8.graphemes, but I still don’t understand utf8.nfcnormalize;
Could you give an example?

The wikipedia page explains it well:

Basically, there’s multiple ways to form the character. A basic “a” character + a combining accent character, or a single “a+accent” character. NFC(ombined) / NFD normalization converts as many of the parts of the string to combined / uncombined form respectively as possible.

  • Combined form is more useful if you want find operations and what not to work naturally, because there is no basic “a” character anywhere in the string anymore.

  • Split form is more useful if you want to do something like strip out all of the accents to get plain ASCII text.

OK.
As for this bug, it is incomprehensible that this exists, since Roblox is multilingual and writing in languages that use accent is common.
I will be spending some hours of my work to overcome this bug.

There’s no other reasonable path unfortunately.

Making the behavior of the string operations significantly different than standard Lua would cause a ton of issues for code portability. If you have suggestions for extending the utf8 library with additional functions that is possible though.

This topic was automatically closed after 6 days. New replies are no longer allowed.