Substring behaving weirdly with non-ASCII characters?

vanilla_wizard · August 25, 2022, 8:53pm

Context:

I’m creating a simple loop that iterates over each character of a string, checks if it’s in a table of accepted characters, and prints if there are any unaccepted characters.

Here’s the relevant code snippet (note that there are no undefined variables and that this code worked fine for years until I started testing it against new characters):

-- Checks the character/substring against each acceptable character --
local iterator1 = 1;
local iterator2 = iterator1;
local substring
local message2 = "";
local i = 0;
	
while i < length do
	substring = string.sub(msg,iterator1,iterator2);
	iterator1 = iterator1+1;
	iterator2 = iterator1;
	local isAnAcceptedChar = false;
	-- Check the character/substring against each acceptable character --
	for i,v in pairs(Accepted) do
		if v == substring then
			isAnAcceptedChar = true;
		end
	end
	if not isAnAcceptedChar then
		numNonAcceptedChars = numNonAcceptedChars + 1;
		print("Rejected character: " .. substring);
	end
	i = i+1;
end

Additional context: The accepted characters table contains all uppercase and lowercase letters in the standard 26-letter English alphabet and various punctuation characters, but it also contains new characters from the Unicode extended Latin alphabet (such as è for example).

The issue:

When I test with a sentence containing only ASCII characters, everything works normally. But when I try with one of the Unicode characters that I added to the accepted characters table, the issue arises. Here are some examples:

Test #1 - nothing weird happening here
User input: “Can I say this?”
Roblox output:

Original message: Can I say this?
-- No rejected characters

Test #2 - had an issue
User input: “Can I say è?”
Roblox output:

Original message: Can I say è?
Rejected character: �

I’m wondering how è became �. è is in the Accepted table, and Roblox was able to print it normally the first time, so the issue probably has something to do with the part where I try to get a substring? I’m not really sure why this is happening.

dutycall11 · August 25, 2022, 8:56pm

Because roblox is detecting è as an emoji which it puts as question marks

Gooncreeper · August 25, 2022, 8:59pm

Its because these characters are utf8 characters which are composed of smaller characters. if you do print(#“è”) you will see it is two characters. Here is some code so you can visualize the character numbers

str = "è" for i = 1, #str do print(str:sub(i,i):byte()) end

Which should get you 195 and 168. Those are the ascii characters that make up the utf8 characters “è” . I hope I explained this kind of well.

vanilla_wizard · August 25, 2022, 9:09pm

Thank you, you explained it very well

Is there a way I could modify my loop to detect if a character is a UTF8 character composed of two smaller ASCII characters? If it’s possible to identify a UTF8 character then I could instead have the substring consist of both of the ASCII characters that make up the UTF8 character and it should work normally, but I’m not sure how I’d achieve that.

Gooncreeper · August 25, 2022, 9:12pm

Yes you could use the utf8.codes iterator

Here is how you could use it:

for i, c in utf8.codes(str) do 
    print(c)
end

Note: c will be the character number.

Additionally: a utf8 character can be up to 4 characters long.