Issues with string pattern matching decimal numbers

Hi! I’ve been working on a custom markup system to add extra richtext tags; one feature of this is being able to pass properties e.g.,

-- Built in <font> tag
<font size='50' face='Michroma'>

-- My custom <blink> tag
<blink color1='rgb(255, 0, 0)' color2='rgb(0, 255, 0)' threshold='0.5' speed='3'>

Ripping property values is easy to do using string patterns (see: String Patterns)

-======================================

Example:

local tag = "<font size='50' face='Michroma'>"
local sizePattern = "size='(%d+)'"
local facePattern = "face='(%a+)'"

local size = tag:match(sizePattern)
local face = tag:match(facePattern)

print(size, face)
-- Output: 50 Michroma

Nice!

-======================================

Lets try with my more complex tag…

local tag = "<blink color1='rgb(255, 0, 0)' color2='rgb(0, 255, 0)' threshold='0.5' speed='3'>"
local color1Pattern = "color1='rgb((%d+),%s?(%d+),%s?(%+d)'" -- %s? allows 0 or 1 whitespace
local color2Pattern = "color2='rgb((%d+),%s?(%d+),%s?(%+d)'"
local thresholdPattern = "threshold='(%d+)'"
local speedPattern = "speed='(%d+)'"

local color1 = {tag:match(color1Pattern)}
local color2 = {tag:match(color2Pattern)}
local threshold = tag:match(thresholdPattern)
local speed = tag:match(speedPattern)

print(color1, color2, threshold, speed)
-- Output:{} {} nil 3

That output is not what we expected…
:warning: In tag, the color properties and threshold property are using ( ) . which are Magic Characters

-======================================

We can create a fix for the color/rgb issue by using curly brackets instead. I am however a bit stumped as to how to handle decimal represenation in threshold='0.5'

  • How about typing a scientific notation? Unfortunately, 0.5 == 5e-1. - is a Magic Character
  • We can escape the . by instead using %. - but then this is in the format %d%p%d :frowning:

Would greatly appreciate some thoughts/ideas here!

1 Like

You’ll have to match against two patterns per number, because there are two distinct cases that can’t both be encoded in a single pattern because Lua string patterns aren’t especially powerful compared to regex.


local scientic_suffix_pattern = "[e[-+]?%d+]?" --entire thing is optional
local decimal_number_pattern_1 = "[-]?%d+[%.]?%d*" .. scientic_suffix_pattern  --matches 1. or 1.1 but not .
local decimal_number_pattern_2 = "[-]?%d*[%.]?%d+" .. scientic_suffix_pattern  --matches .1 or 1.1 but not .

function match_decimal_number(s, init)
    local match1 = s:match(decimal_number_pattern_1, init)
    local match2 = s:match(decimal_number_pattern_2, init)
    if match1 and match2 then
        if s:find(match1) <= s:find(match2) then
            return match1
        else
            return match2
        end
    elseif match1 then
        return match1
    else
        return match2
    end
end

local n = 0
function test(s)
    n = n + 1
    local m = match_decimal_number(s)
    if (m and tonumber(s)) then print(n, true, m) else print(n, false) end
end

--Failing cases
test(".")
test("e-1")
test("1e")
test("-")
test("-e1")
test("-e-1")

--Passing cases
test("123")
test("-123") --don't wanna type twice as many tests, just assuming negative numbers work in general xD
test("123.123")
test(".123")
test("123.")
test("123.123e1")
test(".123e1")
test("123.e1")
test("123.123e-1")
test(".123e-1")
test("123.e-1")
test("123.123e+1")
test(".123e+1")
test("123.e+1")



Thanks @ThanksRoBama!

Thought I might have to delve into more complicated pattern matching… I’m not a fan of having to use multiple patterns when I shouldn’t have to. So I made my own matching function, and thought I’d share it here for anyone else who encounters this issue in future.

---
---Behaves very similiarly to string.match(), but is simpler!
---E.g., with string.match(), there is no easy way to rip out a decimal value like "0.3", as it is in format %d%p%d.
---Position `#` in places that you want to read values from.
---This function will not like `#` at the very start of the string, or `#` one after the other, or `#` as an innocent character.
---TODO: Be able to escape `#` characters.
---
---Encountering issues e.g., "unfinished capture"? You may have magic characters in the string you need to replace for something else.
---https://developer.roblox.com/en-us/articles/string-patterns-reference
---
---EXAMPLE:
---
---  local str = "<blink color1='rgb(255, 0, 0)' color2='rgb(0, 255, 0)' threshold='0.5' speed='3'>"
---
---  local thresholdPattern = "threshold='#'"
---  print(StringUtil:Match(str, thresholdPattern)) -- Output: {"0.5"}
---
---  local color1Pattern = "color1='rgb(#,#,#)'"
---  print(StringUtil:Match(str, color1Pattern)) -- Output: {"255", " 0", " 0"} (! Both 0 will have space infront due to `str` formatting, but this is after we apply tonumber())
---  
---@param str string
---@param pattern string
---@return string[]
---
function StringUtil:Match(str, pattern)
    local patternSplits = pattern:split("#") -- gets string portions between each #

    -- EDGE CASE: No # found!
    local totalSplits = #patternSplits
    if totalSplits <= 1 then
        return {}
    end

    -- Get total expected matches from pattern
    local totalMatches = 0
    for i = 1, pattern:len() do
        local char = pattern:sub(i, i)
        if char == "#" then
            totalMatches = totalMatches + 1
        end
    end

    -- Loops through string until we find a match of our full pattern (e.g., we may find a partial match that isn't actually the true match)
    local matches = {}
    local atPos = 0
    while true do
        matches = {}

        -- Find start of the pattern
        local posStart, posEnd = str:find(patternSplits[1], atPos)

        -- ERROR: No start found!
        if not posStart then
            return {}
        end

        atPos = posEnd + 1
        for i = 2, totalSplits do
            local thisStart, thisEnd = str:find(patternSplits[i], atPos)
            if thisStart then
                local bit = str:sub(posEnd + 1, thisStart - 1)
                table.insert(matches, bit)
            else
                -- Didn't find this bit, so not a full match. Loop back round again.
                break
            end

            posStart = thisStart
            posEnd = thisEnd

            atPos = posEnd + 1
        end

        if #matches == totalMatches then
            return matches
        end
    end

    return {}
end

I haven’t extenisvely tested it, but heres a couple tests:

local str = "<blink color1='rgb{255, 0, 0}' color2='rgb{0, 255, 0}' threshold='0.5' speed='3'>"
local speedPattern = "speed='#'"
local thresholdPattern = "threshold='#'"
local color2Pattern = "color2='rgb{#,#,#}'"
local badPattern = "sausage='#'"
local evilPattern = "speed='3^"

print(match(str, speedPattern), "SHOULD BE", "{3}")
print(match(str, thresholdPattern), "SHOULD BE", "{0.5}")
print(match(str, color2Pattern), "SHOULD BE", "{0, 255, 0}")
print(match(str, badPattern), "SHOULD BE", "{}")
print(match(str, evilPattern), "SHOULD BE", "{}")
1 Like

That’s really handy! Consider posting in #resources:community-resources if you feel like it, I’m sure others have run into the same issue and it could be really helpful ^.^

1 Like

You can merge dec_1 and dec_2 into a single pattern.

"[-]?%d*[%.]?%d*"

1 Like

Thing is, that accepts the string “.” which Lua does not. I.e. you’ll get an error if you try `local a = .". Two patterns are needed to ensure that if there’s 0 digits on one side of the decimal point, there has to be 1 or more on the other.

In what scenario would a number just be .?

1 Like

It wouldn’t, and that’s the point. The pattern you posted would match “.” in the situation where the first %d matches 0 characters and the second %d also matches 0 characters, which means it doesn’t exactly match only valid numbers.

1 Like

%d* matches zero or more, you may be getting * confused with -.

1 Like

Yes, that’s exactly the point (no pun intended). Both the left and right side of the decimal point might match 0 characters, meaning the pattern would match “.”. That’s not a valid number literal in Lua so it shouldn’t match “.” which it does. E.g.:

local p = "[-]?%d*[%.]?%d*"
local s = "According to all 10 test 0.10 10.0 known laws of aviation, there is no way a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because bees don't care what humans think is impossible."

for num in s:gmatch(p) do
    print("ayy", num)
end

… would match the numbers in the text, but also every single period which it shouldn’t. Another issue is that it matches the empty string.

During testing I wrote some code that might come in handy. Specifically it allows finding all numbers in a string:

local p = "[-]?%d*[%.]?%d*"
local s = "According to all 10 test .10 10. known laws of aviation, there is no way a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because bees don't care what humans think is impossible."

local scientic_suffix_pattern = "[e[-+]?%d+]?" --entire thing is optional
local decimal_number_pattern_1 = "[-]?%d+[%.]?%d*" .. scientic_suffix_pattern  --matches 1. or 1.1 but not .
local decimal_number_pattern_2 = "[-]?%d*[%.]?%d+" .. scientic_suffix_pattern  --matches .1 or 1.1 but not .

function find_decimal_number(s, init)
    local find1, end1 = s:find(decimal_number_pattern_1, init)
    local find2, end2 = s:find(decimal_number_pattern_2, init)
    if find1 and find2 then
        if find1 <= find2 then
            return find1, end1 
        else
            return find2, end2 
        end
    elseif find1 then
        return find1, end1 
    else
        return find2, end2 
    end
end

--Print all numbers in the string
local i = 0
repeat
    local next_i, next_j = find_decimal_number(s, i)
    if next_i then
        i = next_j
        print(s:sub(next_i, next_j))
    end
until not i
-- prints 10, .10, 10, and never prints . or the empty string

Oh, and both - and * are zero or more repetitions. - matches as few as possible, * as many as possible.

1 Like

Right, but it matching . is a non-issue because you’ll never be attempting to match a single period.

It’s akin to worrying about satisfying a condition which will never arise.