Multilingual Text Manipulation

Multilingual Text Manipulation

For those who’d prefer, a version of this article is available on Medium

A few weeks ago, I released luau-character: an open-source Luau library on GitHub. If you work with text written in different languages, you may have encountered the same kind of issues that this library solves: case conversion and text classification.

Case Conversion

Recently I was working on a project and I used the string.lower and string.upper functions. I realized that those functions were very limiting.

As some of you might have guessed: I speak French (it’s my first language). When I’m working on something that changes some potentially-translated text into upper case with string.upper, I see some issues:

English word French word string.upper Actual uppercase
corn maïs MAïS MAÏS
shallot échalote éCHALOTE ÉCHALOTE
blackberry mûre MûRE MÛRE

As you can see, letters with accents aren’t properly updated with string.upper (and string.lower). This is because they are not part of the ASCII character encoding, which is the encoding that the string functions use. Many languages have characters that aren’t in this encoding.

To support users from around the world, you need to work with the Unicode encoding, a much more universal standard that allows computers to represent text from almost every writing system.

I created luau-character, which is a Luau library that includes two functions to replace string.lower and string.upper:

local character = require('@pkg/luau-character')

local baseText = "Maïs"

local up = character.toUppercase(baseText) --> "MAÏS"

local low = character.toLowercase(up) --> "maïs"

Text Classification

The luau-character library offers also multiple classification functions. Pass a string value to these functions and they will return true if all characters match a certain property.

Name Description
isAscii all characters are within the ASCII range
isAlphabetic all characters are alphabetic
isNumeric all characters are digits
isAlphaNumeric all characters are either alphabetic or numeric
isLowercase all characters are lowercase
isUppercase all characters are uppercase
isControl all characters are control characters (non-printing characters like tab or newline)
isWhitespace all characters are spacing characters (spaces, tabs, etc.)

Installation

The project is hosted on GitHub. You can grab a Roblox model file attached to the latest release:

  1. Navigate to the GitHub releases
  2. Scroll to the Assets section and download the character.rbxm (Roblox model file)
  3. Drag the file into Roblox Studio

The project is also published on the npm registry. You can add luau-character to your project dependencies with:

npm install luau-character
# or
yarn add luau-character

If you find any issues with the library, please open an issue on GitHub.

End Notes

This library is based on the Rust implementation of the unicode core library, ensuring accurate Unicode character classification and conversion. It’s particularly useful for projects that need to handle text from different languages properly, whether you’re building plugins or games.

I like to contribute to the Luau open-source ecosystem. If you appreciate this library or other stuff I built, please consider leaving a tip :sparkling_heart: I have a page on ko-fi where you can contribute or GitHub sponsors are available in any of my projects. This has a direct impact on my ability to update existing projects and create new ones.

I also re-did my website, it’s live at seaofvoices.ca. You will find all my projects and articles there.

Useful Links

A few other articles I wrote:

4 Likes

Great resource you got there, but Case Conversion method you can simpify as:

local function toUppercase(str: string): boolean
	return utf8.nfcnormalize(string.upper(utf8.nfdnormalize(str)))
end

local function toLowercase(str: string): boolean
	return utf8.nfcnormalize(string.lower(utf8.nfdnormalize(str)))
end

I always used this method for case conversion in my language too, I hope it could work perfectly like yours I guess

1 Like