Multilingual Text Manipulation
For those who’d prefer, a version of this article is available on Medium
A few weeks ago, I released luau-character: an open-source Luau library on GitHub. If you work with text written in different languages, you may have encountered the same kind of issues that this library solves: case conversion and text classification.
Case Conversion
Recently I was working on a project and I used the string.lower and string.upper functions. I realized that those functions were very limiting.
As some of you might have guessed: I speak French (it’s my first language). When I’m working on something that changes some potentially-translated text into upper case with string.upper, I see some issues:
| English word | French word | string.upper |
Actual uppercase |
|---|---|---|---|
corn |
maïs |
MAïS |
MAÏS |
shallot |
échalote |
éCHALOTE |
ÉCHALOTE |
blackberry |
mûre |
MûRE |
MÛRE |
As you can see, letters with accents aren’t properly updated with string.upper (and string.lower). This is because they are not part of the ASCII character encoding, which is the encoding that the string functions use. Many languages have characters that aren’t in this encoding.
To support users from around the world, you need to work with the Unicode encoding, a much more universal standard that allows computers to represent text from almost every writing system.
I created luau-character, which is a Luau library that includes two functions to replace string.lower and string.upper:
local character = require('@pkg/luau-character')
local baseText = "Maïs"
local up = character.toUppercase(baseText) --> "MAÏS"
local low = character.toLowercase(up) --> "maïs"
Text Classification
The luau-character library offers also multiple classification functions. Pass a string value to these functions and they will return true if all characters match a certain property.
| Name | Description |
|---|---|
isAscii |
all characters are within the ASCII range |
isAlphabetic |
all characters are alphabetic |
isNumeric |
all characters are digits |
isAlphaNumeric |
all characters are either alphabetic or numeric |
isLowercase |
all characters are lowercase |
isUppercase |
all characters are uppercase |
isControl |
all characters are control characters (non-printing characters like tab or newline) |
isWhitespace |
all characters are spacing characters (spaces, tabs, etc.) |
Installation
The project is hosted on GitHub. You can grab a Roblox model file attached to the latest release:
- Navigate to the GitHub releases
- Scroll to the
Assetssection and download thecharacter.rbxm(Roblox model file) - Drag the file into Roblox Studio
The project is also published on the npm registry. You can add luau-character to your project dependencies with:
npm install luau-character
# or
yarn add luau-character
If you find any issues with the library, please open an issue on GitHub.
End Notes
This library is based on the Rust implementation of the unicode core library, ensuring accurate Unicode character classification and conversion. It’s particularly useful for projects that need to handle text from different languages properly, whether you’re building plugins or games.
I like to contribute to the Luau open-source ecosystem. If you appreciate this library or other stuff I built, please consider leaving a tip
I have a page on ko-fi where you can contribute or GitHub sponsors are available in any of my projects. This has a direct impact on my ability to update existing projects and create new ones.
I also re-did my website, it’s live at seaofvoices.ca. You will find all my projects and articles there.
Useful Links
- github.com/seaofvoices/luau-character
- All my projects and articles on seaofvoices.ca
- ko-fi.com/seaofvoices
- Sea of Voices Github
A few other articles I wrote: