String Split Function with A Table of Separators?

Hey! I’ve found myself in a situation where I need an alternative to string.split(), not only do I need multiple separators, I want each separator to have variable lengths and symbols

I’m not asking for a script, I’m pretty sure I could write one up that just bruteforces string.find as it collects each occurrence, but I was wondering if this has already been done in a cleaner, more efficient way

EDIT: Another detail that drastically complicates this issue: I’d like the separator to be kept in the string, not deleted like what string.split does

Any help is appreciated!

Module:

local Splitter = {}

function Splitter.universalSplit(texto, ...)
	local separadores = {...}
	local resultados = {texto}

	for _, sep in ipairs(separadores) do
		local temp = {}

		for _, fragmento in ipairs(resultados) do
			local inicio = 1
			local s_inicio, s_fin = string.find(fragmento, sep, inicio, true)

			while s_inicio do
				local parte = string.sub(fragmento, inicio, s_inicio - 1)
				if parte ~= "" then
					table.insert(temp, parte)
				end
				inicio = s_fin + 1
				s_inicio, s_fin = string.find(fragmento, sep, inicio, true)
			end

			local resto = string.sub(fragmento, inicio)
			if resto ~= "" then
				table.insert(temp, resto)
			end
		end
		resultados = temp
	end

	return resultados
end

return Splitter

Example of Use:

local Splitter = require(game.ReplicatedStorage:WaitForChild("Splitter"))

local data = "Level1>>Player_Pro---Cash:500>>LOL"

local List= Splitter.universalSplit(data, ">>", "_", "---", ":")

for _, v in ipairs(List) do
	print(v)
end

If that’s what you’re looking for, then this is perfect for you.

1 Like

In English:

local StringUtil = {}

--[[
    Splits a string using an unlimited number of custom separators 
    of any length or symbol combination.
]]
function StringUtil.multiSplit(inputString: string, ...: string)
    local separators = {...}
    local results = {inputString}

    -- If no separators are provided, return the original string in a table
    if #separators == 0 then
        return results
    end

    for _, sep in ipairs(separators) do
        local tempResults = {}
        
        for _, fragment in ipairs(results) do
            local searchIndex = 1
            -- 'true' enables plain text search (skips Lua patterns for speed)
            local startPos, endPos = string.find(fragment, sep, searchIndex, true)

            while startPos do
                local segment = string.sub(fragment, searchIndex, startPos - 1)
                
                -- Only add non-empty segments to the list
                if segment ~= "" then
                    table.insert(tempResults, segment)
                end
                
                searchIndex = endPos + 1
                startPos, endPos = string.find(fragment, sep, searchIndex, true)
            end
            
            -- Add the remaining part of the string
            local remaining = string.sub(fragment, searchIndex)
            if remaining ~= "" then
                table.insert(tempResults, remaining)
            end
        end
        results = tempResults
    end

    return results
end

return StringUtil

Simplified form

local text = "Manzana---Pera,Uva[CORTE]Piña"

local items = {}



for word in string.gmatch(text, "([^%-,%[%]]+)") do

    table.insert(items, word )

end
1 Like

My first thought is to substitute every delimiter (separator) with a single, universal delimiter, then you can split the string normally. You can generalize this process by keeping each delimiter in an array. When you want to split a string containing a combination of these delimiters, you can loop through each delimiter in the array and replace them with a universal delimiter using gsub before splitting the string.

It’s not the most optimized process because Lua uses string interning (search it up), but as long as your strings aren’t too large then I think your memory usage should be fine.

Something like this:

local UNIVERSAL_DELIM = "@" -- this delimiter can also exist in the delims table for this specific task, but I made it unique for the example
local delims = {".", "?", ":/:"}
local function splitButWithMultipleDelims(str) -- you should probably change this function name to something more concise
    local newStr = str
    for _, delim in delims do
        newStr = string.gsub(newStr, delim, UNIVERSAL_DELIM)
    end
    return string.split(newStr, UNIVERSAL_DELIM)
end
2 Likes

Quite an interesting problem. I decided to take my own crack at it:

type Characters = {string}



local function get_characters(text: string): Characters
    return string.split(text, "")
end

local function get_delimiter_characters(delimiters: {string}): {Characters}
    local results = table.create(#delimiters)
    
    for index, delimiter in delimiters do
        results[index] = get_characters(delimiter)
    end

    return results :: any
end

local function split_multi_delimiter(text: string, delimiters: {string}): {string}
    local characters = get_characters(text)
    local delimiters = get_delimiter_characters(delimiters)

    local match_index   = 1
    local current_index = 1
    
    local results = {}


    local function is_terminal(delimiter: Characters): boolean
        for offset, symbol in delimiter do
            local is_match = symbol == characters[current_index + offset - 1]
            if not is_match then
                return false
            end
        end

        return true
    end

    local function match_delimiter(): boolean
        for _, delimiter in delimiters do
            if is_terminal(delimiter) then
                return true
            end
        end

        return false
    end

    local function record_slice()
        local slice = string.sub(text, match_index, current_index - 1)

        match_index = current_index + 1
        
        table.insert(results, slice)
    end


    for _ = 1, #text do
        if match_delimiter() then
            record_slice()
        end

        current_index += 1
    end

    record_slice()

    return results
end


print(split_multi_delimiter("Hello, world! My name is Ziffix.", {",", "!"}))
--[[
{
  "Hello",
  " world",
  " My name is Ziffix.",
}
]]

Getting to write this in a more C-like language would have been easier, and I could make more micro-optimizations

That is a very elegant solution. Its only downfall is that the universal delimiter cannot exist naturally within the given string

I mean, that goes for any delimiter, right? Of course the universal delimiter doesn’t have to be the @ symbol, just anything you know won’t naturally appear in the string. I believe it could be multiple characters as well to further prevent natural occurrence.

Instead of an arbitrary universal delimeter based on “probably won’t be an issue”, it could be let’s say the first real delimeter. I am a bit tired so hopefully this is correct reasoning haha

So for example, if the input delimeters are A, B, and C then choose the universal delimeter as A.

EDIT: Also there is a problem with using gsub since there is no way to make the pattern plain (unless manually). Your example would not work if . (period) is a delimeter because that will match any character.

I like that. Only introduces an extra string.find which is easy to implement.

You’re right. I knew I was forgetting something about Lua patterns. I believe %. should work. If a delimiter consists of multiple characters then I think in my function you can make it insert a % before each character so that the pattern matches the literal characters.

Here is my attempt at a general-purpose solution. I haven’t found a case that could break this yet. Let me know!

local function split(
	input: string,
	sep: { string }
)
	if #sep == 0 then
		return { input }
	elseif #sep == 1 then
		return string.split(input, sep[1])
	end
	
	local result: { string } = {}
	local len = #input
	local i = 1
	
	repeat
		
		local k = 0
		local w = 0
		local j = len
		
		for _, v in sep do
			local a, b = string.find(input, v, i, true)
			if a and a <= j then
				j = a
				w = b - a
				k = -1
			end
		end
		
		local sub = string.sub(input, i, j+k)
		table.insert(result, sub)
		
		i = j + w + 1
		
	until k == 0
	
	return result
end
1 Like

This is what I ended up with after first implementing it using recursion and then converting it to a loop:

local function split(s: string, ...: string): {string}
	local parts = {s}
	for i = 1, select("#", ...) do
		local separator = select(i, ...)
		local newParts = {}
		for _, part in pairs(parts) do
			local subparts = string.split(part, separator)
			for _, subpart in pairs(subparts) do
				table.insert(newParts, subpart)
			end
		end
		parts = newParts
	end
	return parts
end
1 Like

this is the best solution in my opinion, additionally, you can even make so it generates a regex from array of seperators

2 Likes

Please use more concise variable names in the future. Single-letter variables are harmful to readability in several ways, with the most damaging consequence being the inability to efficiently discern how the algorithm works

1 Like

i actually already used this method for another reason funnily enough, but the issue I forgot to specify is that I need to preserve the separator as part of the array item (for context I’m trying to extract every term of a polynomial in the form of a string

this is a very nice function, unfortunately my separators are longer than one letter (essentially words as patterns)

i love how concise this one is, the problem does rise once more that I’m looking to preserve the separator in the string (split automatically barres it). This detail wasnt communicated in my post and I do apologize for that

My algorithm supports separator strings of any length. Try it out!

1 Like

you’re right haha! I’m a little slow today
i could see this function working great for some other usecases of mine, but I just edited the post because I forgot to mention that I’d like the string patterns to be kept, not deleted like the split method does. Any ideas how that could be implemented? I was thinking keeping the previous find index and plugging it into the sub’s start index

Do you mean the separators would be in the result array also?

1 Like