I have been heavily struggling to understand what utf8 is exactly used for recently as I have been unsuccessful every single time in trying to understand what it is used for, I just can’t wrap my head around it and I have found almost no devforum topics or replies talking about it let alone explain it in detail plus there are no community tutorials about it too even the Roblox Developer API Reference still doesn’t help me and there are no tutorials about it on youtube, so I need massive help to understand utf8.
So I wanted to make a thread about this because this hasn’t ever been asked before by anyone else in the devforum, it seems that it’s extremely rare for someone to ever mention utf8 here on the devforum, I haven’t even seen anyone use it in their scripts here on the devforum too.
What exactly is utf8? is it important? I need a detailed explanation on what it is used for, what it is mainly used for and if it’s important or not, it’d also be very nice if you can also give me some examples that utf8 would be used in, also I have heard that it’s a way of “encoding” so does that mean that it can be used as a possible way of obfuscating scripts for better security (for people trying to steal scripts throught an exploit or something) or at least obfuscating strings and “decoding” them back when you want them?
Thanks for your time and I hope that you can give me any information about this, it seems very hard to understand at least for people who have been recently trying to understand it so hopefully this thread would tell me and the people who have been struggling to understand it what it’s used for too, at the very least this could help us have a better understanding of utf8.
It’s not something at all relevant to Roblox Lua. It’s a universal standard for encoding text, anywhere. You can’t choose to use it in Roblox, since they don’t expose anything that low level (it’s probably what the script editor uses, though). Basically, computers store things in a series of “on” and “off” values. These on and off values can represent binary numbers. But how do they represent text? That’s where UTF-8 comes in: it specifies that specific numbers should represent specific characters. For example, a UTF-8 “H” would be U+0048, or the number 72.
Edit: Whoops, I’m dumb. There is a UTF-8 library in Roblox. You can use this to convert characters to their codepoints (a name for the numbers they use) and vice-versa. It also has a couple other utility functions like checking something’s length and finding a codepoint’s position. It’s still not relevant for obfuscation, though.
As you may know, you can convert characters to integers via string.byte (and vice versa with string.char). This works well for ASCII characters. However, with the introduction of other characters like emojis and non-English characters (e.g. characters from other languages), running string.byte and string.char does not suffice. That’s where the utf8 library comes into play. One use is that it allows you to break down all characters into numbers (codepoints).
The post I linked gives you a good rule of thumb for when you should use the utf8 library:
You should use these functions anywhere you need to manipulate text that you didn’t write yourself or may contain non-ASCII or non-English characters. If you truncate a string at a byte index that is not between whole codepoints you will end up with an invalid UTF-8 string that may render incorrectly or cannot be stored in a DataStore.
I use the utf8 library in my string compression module, which I use for serializing objects (serialization takes up a lot of data, so compressing it helps a lot with minimizing datastore requests). While I could just use the regular string.char and string.byte functions, it’s quite possible that certain objects (properties) have non-ASCII characters (e.g. emojis). As such, it’s necessary for me to use the utf8 library to break down the non-ASCII characters into ASCII characters that are compatible with my string compression module.
It’s possible that you could incorporate the utf8 library into string obfuscation, but the utf8 library doesn’t give you a magic function to just obfuscate strings (technically getting the codepoints could be considered ‘obfuscating’ but that would easily be de-obfuscated) – you’d have to make your own string obfuscation method. The utf8 library would simply enable you to obfuscate non-ASCII characters (like emojis or non-English characters).