How can i catch diacritics letters with string.gsub

goreacraft · January 3, 2023, 3:16am

What do you want to achieve? : be able to detect diacritics letters like “șțpăîâ” so i can uppercase them
What is the issue? : it simply not working with available Character Classes %a, %l, %w, %u… using the “any character” = “.” will not be useful
What solutions have you tried so far?: i tried all the above character classes


local firstUpperExpression = "%^(%a)"
function firstToUpper(str)	
	return (string.gsub(str, firstUpperExpression,string.upper))
end
print(firstToUpper("^învață să ^ăsd"))

will output: ^învață să ^ăsd
Desired output: Învață să Ăsd

if i do

print(firstToUpper("^invață să ^asd"))

will work and output: Invață să Asd – so my guess is that %a does not catch diacritics

7z99 · January 3, 2023, 11:06pm

So the reason this is happening is because the character “ă” contains more than 1 byte, the string library is only generally capable of working with characters that only use a single byte (characters whose codepoints are between 0 and 255) with some exceptions depending on which function you’re using (for example gsub still works here).

Since multi-byte characters’ first bytes are always control characters and the string library only reads the first byte of these characters, Lua doesn’t see it as a character that has an uppercase form.

I looked around for quite awhile and was unable to find a straightforward answer for switching between cases for these characters, mostly that it depended on C’s locale which AFAIK is not something the developer can control. Subtracting 1 from the first byte does sometimes give you the expected result (in the case of ă it gives Ă but in the case of î, it gives the register symbol), but as it’s not reliable we need another method. This file exists that already wrote out lower and uppercase characters and vice versa. While it’s the manual approach, it’s really the only way to do it in pure Lua.

So, we can convert this to a module and write our own function to do this string conversion.

local utf8Data = require(script:WaitForChild('UTF8Data'))

-- note this isn't the most efficient implementation I'm sure but it should work fine as long as you aren't using
-- a string that's like a million to-be-gsubbed characters

local function firstToUpper(str: string)
	local hasReachedEnd = false
	while str:find('%^') and not hasReachedEnd do -- while '^' exists in str and the loop hasn't reached the end do
		local previousGraphemeBegin, previousGrapheme
		for startOfGrapheme, endOfGrapheme in utf8.graphemes(str) do
			if endOfGrapheme >= str:len() then
				-- if we reached the end we want to tell the topmost loop to break since we don't want to have an infinite loop
				-- see [1] below for example
				hasReachedEnd = true
			end
			-- thisGrapheme is equal to the grapheme in 'str' that starts at the position 'startOfGrapheme' and 
			-- ends at 'endOfGrapheme'. Let's define it here for readability since we need it in either case
			local thisGrapheme = str:sub(startOfGrapheme, endOfGrapheme)
			
			-- if the previous grapheme is equal to ^ and this grapheme is found in the UTF-8
			-- character data module, we know this grapheme should be converted to uppercase
			if previousGrapheme == '^' and utf8Data.lower_upper[thisGrapheme] then
				-- now, we need to replace the previous grapheme as well as the current grapheme
				-- also note that we need to add % to the beginning of the character because
				-- we want the escape form, we don't want the ^ modifier character to be used
				str = str:gsub('%' .. str:sub(startOfGrapheme - 1, endOfGrapheme), utf8Data.lower_upper[thisGrapheme])
				break
			end
			-- finally, we mark the beginning of this grapheme and the grapheme itself
			previousGraphemeBegin, previousGrapheme = startOfGrapheme, thisGrapheme
		end
	end
	return str
end

print(firstToUpper("^învață să ^ăsd")) --> Învață să Ăsd

-- [1] - since we are only gsubbing when a character has a valid uppercase form,
-- 		 we want to break if the end of the string is reached because '^' exists
--		 in the string still.
--		 for example, the whitespace character " " doesn't have an uppercase form
-- 		 had we not implemented the endOfGrapheme check above, this would cause an
--		 infinite loop
print(firstToUpper('^ hello')) --> ^ hello
print(firstToUpper("^ ^învață să ^ăsd")) --> ^ Învață să Ăsd

I made it a free model here since I don’t want to paste the giant thousand-line module here:

goreacraft · January 4, 2023, 9:08pm

Thank you for the time and effort you spent writing this comprehensive response

system · January 18, 2023, 9:08pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.