Warning: Potentially offensive
UTF-8 is a character encoding that allows characters to be represented in a variety of ways. The most common way to represent UTF-8 is with a byte order mark (BOM). A BOM is a byte order mark that indicates the byte order of a file. The BOM for UTF-8 is 0xEF 0xBB 0xBF.
When using the Lua library utf8, the BOM is not required. However, if you want to use the BOM, you must first call the utf8.bom() function. This function will return the BOM for UTF-8.
Once you have the BOM, you can use the utf8.char() function to create a UTF-8 character. This function takes two arguments: the first is the code point for the character, and the second is the number of bytes for the character. For example, to create the character ‘a’, you would use the following code:
local a = utf8.char(0x61, 1)
The code point for ‘a’ is 0x61, and the number of bytes for ‘a’ is 1.
You can also use the utf8.codes() function to get the code points for a UTF-8 string. This function takes a string as its only argument and returns a table of code points. For example, to get the code points for ‘abc’, you would use the following code:
local abc = "abc"
local codePoints = utf8.codes(abc)
The code points for ‘abc’ are {0x61, 0x62, 0x63}.
If you want to know the number of bytes for a UTF-8 character, you can use the utf8.len() function. This function takes a string as its only argument and returns the number of bytes for the string. For example, to get the number of bytes for ‘abc’, you would use the following code:
local abc = "abc"
local numBytes = utf8.len(abc)
The number of bytes for ‘abc’ is 3.
You can also use the utf8.offset() function to get the byte offset for a particular code point in a UTF-8 string. This function takes two arguments: the first is the string, and the second is the code point. It returns the byte offset for the code point. For example, to get the byte offset for the code point 0x61 in ‘abc’, you would use the following code:
local abc = "abc"
local offset = utf8.offset(abc, 0x61)
The byte offset for the code point 0x61 in ‘abc’ is 1.
One advanced use of the utf8 library is to normalize UTF-8 strings. Normalization is the process of making sure all equivalent strings are represented in the same way. For example, the string “é” can be represented as either “é” or “\u00e9”. Normalization will make sure both strings are represented as “\u00e9”.
To normalize a UTF-8 string, you can use the utf8.normalize() function. This function takes two arguments: the first is the string to normalize, and the second is the normalization form. The normalization form can be one of the following:
- NFC: Normalization Form Canonical Composition. This is the default form.
- NFD: Normalization Form Canonical Decomposition.
- NFKC: Normalization Form Compatibility Composition.
- NFKD: Normalization Form Compatibility Decomposition.
For example, to normalize the string “é” to NFC form, you would use the following code:
local s = "é"
local normalized = utf8.normalize(s, "NFC")
The normalized string is “\u00e9”.
There are many other uses for the utf8 library. For more information, see the Lua 5.1 Reference Manual: Lua 5.1 Reference Manual
Normalization is the process of making sure all equivalent strings are represented in the same way. This is important because some systems may not recognize strings in different forms as being equivalent. For example, the string “é” can be represented as either “é” or “\u00e9”. Normalization will make sure both strings are represented as “\u00e9”.
There are four normalization forms: NFC, NFD, NFKC, and NFKD. NFC is the default form.
NFC: Normalization Form Canonical Composition. In this form, all combining characters are composed into a single character. For example, the string “a\u0300” (which is equivalent to “à”) would be normalized to “\u0061\u0300” (which is equivalent to “à”).
NFD: Normalization Form Canonical Decomposition. In this form, all characters are decomposed into their constituent parts. For example, the string “\u0061\u0300” (which is equivalent to “à”) would be normalized to “a\u0300” (which is equivalent to “à”).
NFKC: Normalization Form Compatibility Composition. In this form, all characters are composed into a single character, and all compatibility characters are converted to their canonical equivalents. For example, the string “\u0061\u0300” (which is equivalent to “à”) would be normalized to “\u0061\u0300” (which is equivalent to “à”), and the string “\uFB01” (which is the compatibility equivalent of “fi”) would be normalized to “\uFB01” (which is the canonical equivalent of “fi”).
NFKD: Normalization Form Compatibility Decomposition. In this form, all characters are decomposed into their constituent parts, and all compatibility characters are converted to their canonical equivalents. For example, the string “\u0061\u0300” (which is equivalent to “à”) would be normalized to “a\u0300” (which is equivalent to “à”), and the string “\uFB01” (which is the compatibility equivalent of “fi”) would be normalized to “\uFB01” (which is the canonical equivalent of “fi”).
There are several reasons why you might want to normalize a string. For example, you might want to make sure that all equivalent strings are represented in the same way before storing them in a database. This would ensure that strings are retrieved correctly from the database.
Another reason to normalize a string is for sorting purposes. If strings are not normalized, then they may not be sorted correctly. For example, the string “é” (which is equivalent to “e”) would be sorted before the string “ê” (which is equivalent to “e”), even though they are equivalent strings.
yet another reason is to ensure that two strings are compared correctly. If strings are not normalized, then they may not be compared correctly. For example, the string “é” (which is equivalent to “e”) would be considered to be different from the string “ê” (which is equivalent to “e”), even though they are equivalent strings.
Normalization is important for many reasons. It is often used to ensure that strings are stored, sorted, and compared correctly.