Formatting a string to remove emojis, while keeping accented letters / foreign symbols

While scripting an item for my game that allows the player to give a custom name to their equipment, I realized that I wanted to disallow spammy whitespace characters like tabs and newlines. I quickly wrote a function to filter out those whitespace characters.

However, I then realized that I didn’t want my users naming their items with eggplant and fire emojis.

I was going to write an expression to disallow anything that isn’t an alphanumeric character or a space, but _then I realized I would be totally screwing over users from other languages, such as Spanish users that want to use assets or Korean users that want to use… well their language.

How can I write a string.find expression only catches emoji, while ignoring all of these linguistic symbols?

4 Likes

This is a rather tedious method but you could possibly go through all of the emojis and use string.byte to grab their numerical codes and store it in a table. Then you could iterate through all the characters of a string and check if the byte of a character is that of an emoji.

Edit: By “going through all of the emojis” I mean inserting them by hand using an emoji keyboard and printing out their string.byte in the output. Then storing number that in a table.

1 Like

It appears all emoji return 240 for this, since string.byte is only intended for ascii, and at least two non-latin alphabets consistently do not. (This is true in vanilla lua, someone please test this in studio.) This could be useful, but it might not be reliable behavior.

Edit: This is the case because emoji are formed of multiple characters, meaning it’s hard to tell exactly where one starts and ends. You’ll have to use a manual checking approach since one emoji can actually have 13 characters in it (try printing the length of :man_guard:)

Hmm, yea that probably isn’t too reliable. However, you could also directly store these emojis as strings in a table and then compare them to every character in a string inputed by the player. This seems to work:
49%20PM

1 Like

Maybe this would be helpful:
http://unicode.org/reports/tr51/tr51-12.html#Identification
It has related links to codepoints and etc.
Also roblox has a utf8 library and also documents a utf8 match pattern in case you wish to do gsub checking or iterate over each utf8 character.
There are also specific character ranges for emojis like the other special characters which can be found in charts or tables.
Example:

Generally, they have the same or similar 2nd to last byte and go in order, starting from around 0x80 for the last byte. I don’t know if there is anything between those empty spaces or before 0x80 but there might be. Similar approaches using ruby:

All emojis are at least 4 bytes in length, and as far as I know no other unicode characters are. Worst case, you can use the utf8 library and filter all characters with 4 or more bytes.

A better way to do what you’re aiming for would be to restrict characters to the basic multilingual plane of Unicode. This covers characters with their codepoint in the range of 0x0000 (0) to 0xFFFF (65535) and contains all ascii characters and the scripts for most western and asian languages.

This runs into an issue for some African scripts and whatnot but it is a quick and easy solution. A better one would be to look at the specific blocks you’re after and whitelist characters in their ranges.

For more information about the planes and the blocks inside them, this article will be helpful: Plane (Unicode) - Wikipedia

1 Like

I made this a long time ago when they first enabled emoji in chat/etc, I’m pretty sure this will work. It filters out all the emoji codepoints and should not filter out anything else. (all of the other things you mention would be separate codepoints or combinations of codepoints that are not in these emoji ranges)

Demojify.rbxm (1.7 KB) (or on Github)

(EDIT: apparently the file download didn’t work, it should work now)


local demojify = require((...).Demojify)

print(demojify("😂i love emojis ✔ they are #⃣1⃣ wait I gotta go 🚽🚀. EMOJI HYPE ✊"))
--> i love emojis  they are #1 wait I gotta go . EMOJI HYPE
30 Likes