UTF-8 Lua libraries now available!

The Lua UTF-8 library ported from Lua 5.3 is now enabled and ready for use!

You can check out the documentation for this library from Lua 5.3 here:
https://www.lua.org/manual/5.3/manual.html#6.5

This library helps you deal with strings at a Unicode codepoint level. In many cases you will want to use these utf8.X functions in place of string.X functions.

Keep in mind: When you read the documentation where they say “character” what they really mean is “codepoint”. When Unicode is involved there are many possible definitions of a “character”:

  • A byte: An 8-bit number value. Strings in Lua are just arbitrary sequences of bytes, so for example, string.len returns the length of the string in bytes.
  • A code unit: the smallest unit that a text encoding uses. In UTF-8, which Roblox uses, this is an 8-bit byte. For comparison, JavaScript and C# use UTF-16 where the code unit is 16-bits (two bytes).
  • A codepoint: A fully encoded Unicode codepoint. In UTF-8 this could be between 1 and 4 bytes.
  • A grapheme cluster: What most people think of as a “character”, a fully composed visual unit, consisting of one or more (unlimited) codepoints. Roblox does not support parsing strings as grapheme clusters just yet, but we plan to in the future. This is a complex issue, and the correct answer to “what is a grapheme cluster” can vary depending on locale.

For example: The family emoji “:family_man_woman_girl_boy:” is 1 grapheme cluster, composed of 7 codepoints (utf8.len), and takes 25 bytes to encode in UTF-8 (string.len).

Some code examples:

local hi = "hello"
print(#hi, string.len(hi), utf8.len(hi)) -- prints "5 5 5"
local hiRussia = "Привет, это русский текст."
print(#hiRussia, string.len(hiRussia), utf8.len(hiRussia)) -- prints "47 47 26"
local hiDad = string.sub"Ciao papà!", 10)
print(#hiDad, string.len(hiDad), utf8.len(hiDad)) -- prints "2 2 nil 1" (utf8.len for invalid UTF-8 returns "nil" and index of the first invalid byte)

Notice that for English text grapheme, codepoint, and byte counts are all the same, but this is not true for non-English text including anything outside of the plain English ASCII range!

If you accidentally use slice a string incorrectly and remove bytes from a codepoint you will end up with an invalid UTF-8 string. If you appended anything after that it would be truncated and not displayed. It’s best to avoid needing to truncate or slice strings in general, but if you must use utf.offset to get a safe index to use for string.sub.

Hopefully these utilities will be of some small help in dealing with international text going forward! More to come!

When should I use these?

You should use these functions anywhere you need to manipulate text that you didn’t write yourself or may contain non-ASCII or non-English characters. If you truncate a string at a byte index that is not between whole codepoints you will end up with an invalid UTF-8 string that may render incorrectly or cannot be stored in a DataStore.

If you are truncating a string at an index you should use string.sub with a byte index given by utf8.offset.

If you are implementing a typewriter effect you should use utf8.codes to iterate over the codepoints in the string as opposed to just the raw bytes, otherwise you will end up with odd and irregular behavior when multi-byte characters are being appended byte by byte.

Length limits

utf8.len isn’t a particularly meaningful length check, unless you want to implement a Twitter style arbitrary length limit, in which case it’s exactly what you want!

If you care about storage you should have a byte length check like the # operator or string.len. If you care about DataStore encoded length, consider using

#game:GetService("HttpService"):JSONEncode({string})-4

If your concern is a visual space constraint consider using a text extents check like TextService:GetTextSize if possible. Neither codepoint or character counts are very meaningful for this use case.

29 Likes

Does anyone know if gsub/gmatch/etc work with UTF-8 and patterns? If not, is there a UTF-8 enabled version?

gsub/gmatch works fine with UTF-8 strings. The patterns like %a won’t recognize letters outside of the US English ASCII range though. There’s not a Unicode smart version yet unfortunately.

2 Likes

Dang, that was the part I was hoping would work :frowning:

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.