How to get Cyrillic Bytedata? utf8 does not support it

Whilst messing around with string.byte and string.char, I noticed that to upper-case a letter in the English Alphabet, this just has to be done:

string.char(string.byte('a')-32)

I tried with cyrillic just to say what happened, so I did:

string.char(string.byte('щ')-32)

But it returned an unsupported character.

So I dug in deeper and realised this:
The bytedata of Щ and щ are apart by 1.
But other cyrillic characters like: а have the same bytedata, which is strange…

Is it because utf8 doesn’t support cyrillic? If so then how can we write in it?
The bytedata of almost all of the characters is: 208. If I do string.char(208), it returns an unsupported character.

1 Like

utf8 supports cyrillic: Unicode/UTF-8-character table - starting from code position 0400

But “supporting” a language is not the same as guaranteeing that you can convert between upper/lower case by adding/subtracting 32 to the byte code of a letter. That’s not a good assumption to use in general, it’s more of a fun hacky way of doing it if you know you’ll only be working with a-zA-Z.

Oh, and cyrillic characters aren’t encoded using a single byte. Compare the UTF-8 (hex) column to that ofthe “basic latin” block: Unicode/UTF-8-character table

Oh and another thing, the string library doesn’t work with utf-8! You should use the utf8 library instead: utf8 You’ll notice that there’s no utf8.byte function, which makes sense because characters are generally not represented by a single byte but a sequence of bytes (codepoints).

AND ANOTHER ONE:

Doesn’t seem so? Here’s the data I got from the table I linked:

щ d1 89 CYRILLIC SMALL LETTER SHCHA
Щ d0 a9 CYRILLIC CAPITAL LETTER SHCHA

Although I don’t speak or read any languages that use Cyrillic, apologies if I’m mistaken :sweat_smile:

Curiously, those letters are offset from each other by 32 so actually that trick might work. How many Cyrillic letters are there? If it doesn’t work, you can always manually create tables / strings to create look-up tables for the conversion, which might actually be the best/most robust approach. Seems like there isn’t a general utf8 case conversion, at least not like with string.upper/string.lower.

1 Like

We have to take in the many cyrillic-using languages.

Russian has 33 letters, and so does Ukranian. But Ukraine has some letters replaced.
Montegrin alphabet is quite complex whilst serb is quite similar. Kazakh and Mongolian also have their own variation and Bulgarian pretty much just merges 2 letters from Russian, ‘Ы’ and ‘И’ to И.

So they are probably at-most around 55-75 different cyrillic letters between all the different systems. Just like Latin, not all alphabets are the same.

(But in Cyrillic the changes are usually larger)

The letter ‘Щ’ is more of a schs, similar to ‘Ш’ (sh), just to let you know.

(sorry for the 4-day late response :sweat_smile:, I forgot to check my notifications)

1 Like