[1.0.0a2] PCRE2-based RegEx Implemention for Luau - A better string pattern library

Blockzez · November 15, 2020, 11:29pm

2023 update: I will likely not maintain the 2020 implementation. I might re-implement this in the future but unfortunately, I have no plans to do so now. Apologies for any inconvenience.

NOTE: This isn’t bug-free and many features aren’t available as this is still in WIP as it’s in alpha

Find the Lua’s string pattern library lacking? No problem, with this RegEx implemention you get to use more RegEx features that “other” programming language has in Luau. This is a PCRE2-based so don’t be surprised if your RegEx pattern from ECMAScript, Python, .NET, etc may not work in a way as you expect.

Movitation to create this

I’ve wanted a better string pattern library module, that includes features like altneration. I’ve looked over the community resouces and found nothing on RegEx.

Why should I use this over Lua’s string library?

Lua’s builtin string pattern library are quite lacking in terms of features, while not nessesary a bad thing, it can be “inconvenient” at because of this. This RegEx implemention has way more useful features like alternation, non-capturing group and group repetition.

Features

Alternation

Want to match either a or bc but not ac or b? You’re looking for altneration. For example a|bc only matches a or bc. It can be inside a group so a(?:bc|d) only matches either abc or ad.

Group repetition

Want to match groups? /(abc)?/ only matches abc or not, nothing less.

Non-capturing groups

Hey, don’t want to capture the group but want to create a group? You can use non-capturing groups. For example (?:ab)* matches ab without capturing it

Lookarounds

Want to require something to be matched without actually matching it? Lookarounds got you.

Flags

Like others we have flags in this RegEx

Flag	Name	Description
i	case insensitive	Ignore ASCII cases
m	multiline	`^` starts the beginning of the line
s	single line (dotAll)	`.` matches new line
u	unicode	Match where patterns like `\d`, `\w`, posix classes, etc are in Unicode instead of ASCII.
x	extended	Ignore whitespaces (unless in character set), and treat from `#` to the end of the line as a comment.

Arbitary amount of repetition

Lua only restricts this with {0,1} (?), {0,} (*), {1,} (+) and {0,}?, with this having choosing the amount of times you want it to match in any value.
For example a{2,} match if it has at least 2 characters that are a and a{3,5} match if has at least 3 characters that are a with maximum of 5 characters to much

Strings interpreted as UTF-8

So characters like 字 won’t be counted twice. You might thank me for later.

Lazy and possesive quantifiers

Want to make it match as few times as possible? Or match without? Use ? (lazy) or + (possesive) after a quantifier!

Other features

There are some features that I’d like to mention; you probably don’t need them.

Comments

Want to comment somehthing when needed or needlessly? No problem, you can use the (?#...) syntax. Alternatively you can use # as a comment with the extended flag.

Atomic group

Disable backtracking for the group, e.g. after bc has been matched in a(?>bc|b)c.

POSIX classes

In the syntax of [:name:] inside a character set. Maybe you prefer this style over escape codes as a class?

POSIX class		Equivalency	Equivalency with Unicode flag
[:alnum:]	Alphanumerical characters	[a-zA-Z0-9]	[\p{L}\p{Nl}\p{Nd}]
[:alpha:]	Alphabetical characters	[a-zA-Z]	\p{L}\p{Nl}
[:ascii:]	ASCII characters	[\x00-\x7F]
[:blank:]	Space and tab	[ \t]	[\p{Zs}\t]
[:cntrl:]	Control characters	[\x00-\x1F\x7F]	\p{Cc}
[:digit:]	Digits	[0-9]	\p{Nd}
[:graph:]	Visible characters (anything aside from spaces and control characters)	[\x21-\x7E]	[^\p{Z}\p{C}]
[:lower:]	Lowercase letters	[a-z]	\p{Ll}
[:print:]	Visible characters and spaces (anything aside from control characters)	[\x20-\x7E]	\P{C}
[:punct:]	Symbols	[!"#$%&'()*+,-./:;<=>?@[\]^_‘{\|}~]	\p{P}
[:space:]	Whitespaces, incl. line breaks	[ \t\r\n\v\f]	[\p{Z}\t\r\n\v\f]
[:upper:]	Uppercase letters	[A-Z]	\p{Lu}
[:word:]	Word characters (letters, numbers and underbars)	[A-Za-z0-9_]	[\p{L}\p{Nl}\p{Nd}\p{Pc}]
[:xdigit:]	Hex digits	[A-Fa-f0-9]

API

(The API is inspired by Python’s re module)

RegEx RegEx.new(string pattern, string flags)
Creates a RegEx pattern without delimiters.

RegEx RegEx.fromstring(string string)
Parses a RegEx literal to compile it such as /example/i, delimiters, although I recommend / as a delimiter, are your choice and can be any character except for backslashes or alphanumerical ASCII characters. So /example/i, ~example~i, %example%i, etc are invalid but 1example1i, \example\i, aexamplea1 etc aren’t. To get the literal character of the delimiter without closing it, escape it with a backslash e.g. /\//i is interpreted as \ with a flag of i

string RegEx.escape(string string)
Escapes the string so it’ll be treated as plain text in this RegEx flavour.

string/nil RegEx.type(any value)
Returns "RegEx" if it’s a RegEx pattern created using RegEx.new or RegEx.fromstring.
Returns "Match" if it’s a RegEx match created by RegEx.match or created via RegEx.sub that are passed to the repl argument if repl is a function.
Otherwise it returns nil.

RegEx.Match class
RegEx.Match class can only be created by RegEx.match, RegEx.matchall and as an argument passed to RegEx.sub the repl argument is a function

number, number RegEx.Match.span(Match match, number/string group)
Returns the span of the matched RegEx

string RegEx.Match.group(Match match, number/string group)
Returns the match of the group as a string

string… RegEx.Match.groups(Match match)
Returns multiple match as a string depending on how many capturing groups.
If there are no capturing groups, return one entire match as a string.

table RegEx.Match.grouparr(Match match)
Returns the match as a table in array form along with the n key for the length (Why? because some matches might contain nil, for example this RegEx.match("/(abc)(def)?/", "abc"):grouparr() has a length of 2 but item 2 is nil).

table RegEx.Match.groupdict(Match match)
Returns the match as a table, with the key as the named capturing groups.
If there’s not named capturing groups, it’ll return an empty table

Methods

You can use the method call for these e.g. RegEx.new(pattern, flag):match(string) for RegEx patterns or directly call it it. Both RegEx.match("/pattern/", string) and RegEx.match(RegEx.new(pattern, flag), string) are accepted.

boolean RegEx.test(RegEx pattern, string string, number init)
A boolean to see does the RegEx pattern match

Match/nil RegEx.match(RegEx pattern, string string, number init)
Returns the match. If the match cannot be found, returns nil.

function RegEx.matchall(RegEx pattern, string string, number init)
Returns a function that acts like an iterator. Whenever it’s called it’ll return the next match.

string, number RegEx.sub(RegEx pattern, string/function[/table] repl, string string, number n, boolean match_class, string flags)
Returns a string where all are occurences of a pattern has been replaced by repl which can either be a string, a function or a table if it has the o flag and a number of occurences it replaces.
if repl is a string:
It’ll substitute the value with the following format if it has $ or \:

Substitute characters	Description
$number	Substitute the character by the capturing group identified by the number
${name/number}	Substitute the character by the capturing group identified by the name or a number if it’s a number
\number	Same as $number

if repl is a function/table:
It’ll pass a match class as an argument and call the function/get the index of the table of the full match string, then get the first returned value (if it returns no value then it’ll be treated as nil), if the returned value is a string or a number then the replacement is literally (no substitions) the returned value (converted to string if it’s a number), if the returned value is a Match object that was passed as the argument, otherwise the replacement will be treated as an empty string (or the entire match if it’s false or nil and it has the o flag).

In version 1.0.0a2+ there’s a flag:

Flag	Name	Description
o	Lua 5.1 (one) mode	This module defaults to Lua 5.0, turn on this flag so it accepts table as the argument, `false` and `nil` returns full match, and any other value errors
l	literal	The `repl` argument is now a literal
u	unknown unset	Unknown groups are treated as an unset

table RegEx.split(RegEx pattern, string string, number n)
Splits the RegEx into table with the pattern argument as the separator pattern. For example RegEx.split("/[[:space:]]/", "hello world") returns {"hello", "world"}

Differences from PCRE2

This flavour of RegEx have many differences from PCRE2:

POSIX syntax for word boundaries ([[:<:]] and [[:>:]]) aren’t available as it’s a legacy feature that I presume only exists for compatibility. I have no plans to include it.
This flavour of RegEx have undefined limit while PCRE2 have specified limit,
This flavour doesn’t have version condition so something like (?(VERSION>=version)before|after) isn’t valid. I might include it but why do you need it?
The only verbs that are supported are newlines conventions (*CR), (*LF), (*CRLF), (*ANYCRLF), (*ANY) and (*NUL), accept, fail, prune and skip (*ACCEPT), (*FAIL), (*PRUNE) and (*SKIP) and lookarounds (*pla:), (*plb:), (*nla:) and (*nlb:).

and features that this RegEx implementtion currently not available for now but are planned to be implemented:

Internal option settings
\Q and \E literals
Unicode extended grapheme cluster \X
Conditional groups
Control escape characters e.g. Ctrl + X is \cX
Probably many more.

Why is it PCRE2-based? Why isn’t it ${insert any other RegEx flavour}-based?

Because it’s my decision ;), things like this gets arbitary. I don’t think there’s a reason behind this. I can decide to make it ECMAScript, ICU, .NET, Python or POSIX (BRE or ERE) flavour based of RegEx instead if I wanted to .

Where to get it?

You can get it here (2020 implementation)
1.0.0a2 (expat licence version as requested by someone I won’t name for privacy reasons): RegEx (expat licence version).rbxm (161.5 KB)
1.0.0a2 (BSD 2-Clause licence): RegEx.rbxm (161.5 KB)
1.0.0a1: RegEx.rbxm (156.9 KB)

Cyafu · November 16, 2020, 12:45am

Thanks so much. As a JavaScript programmer, spending the time to learn regex was hard enough, but then having to do it for a different pattern-matching system in lua was even more annoying. Thanks for porting regex to lua lol

Autterfly · November 16, 2020, 1:13am

Does this module take into account weird regex cases that might scale complexity exponentially? A lot of regex implementations (including PCRE) still struggle with this issue.
https://swtch.com/~rsc/regexp/regexp1.html

PysephDEV · November 16, 2020, 10:51am

AHH YESS!!
I’ve been struggling with Lua’s regex, trying to make a pattern capturing some stuff I needed.
Turns out Lua’s regex library didn’t support ternary operators, so I had to give up…
Look at this horrible monstrosity I made :'([_%a]%w*)%s*%(?(["\']?)([%b()%b""%b\'\'%w]*)%2%)?'

mr_iamthefuryrbx · January 21, 2021, 9:40pm

This is great work @Blockzez, and thank you for sharing it with the community! I would like to discuss an open source implementation with tests available on GitHub. My team and I have been adding tests internally, and we’d like to give back to your efforts. Please message me if you have a moment to talk.

Thank you again!

Jeff Hampton

XOLT1268 · April 12, 2021, 11:01pm

Could you help me out a bit? I have the pattern:
“ (\w+)(\ ?=\ ?)(.*)” and want to match something like var=anything, it works inside a pcre2 tester, just not the library.

Any help on changing it around maybe so that it’ll work?

Blockzez · April 13, 2021, 10:44am

It’s currency in alpha therefore I expect bugs like this to persist, but unfortunately I cannot reproduce this.

RegEx.fromstring("/(\\w+)(\\ ?=\\ ?)(.*)/"):match("var=anything")

Make sure the string is actually \w. \ is a special character for string literals, so to get the string \w, escape \ by \\ or use raw string literals ([[]] and [=[]=] where = can be the arbitrary amount but that amount must match on both sides).
I have fallen for that trap before.

XOLT1268 · April 13, 2021, 4:31pm

I see, thank you I will try it out later today. Had no idea about the \ thing.

majdTRM · August 27, 2021, 9:53pm

This is really great! However, it would be nice if you could upload it as a Roblox package that way every time I have to use it I wouldn’t have to copy and paste.

User9684 · August 16, 2023, 2:25am

Wonderful library, but I do have a slight issue.

I am attempting to use the regex /(?:[0-9]*(?:[/*\-+])[0-9]+)+/ to detect mathmatic equations, and it seems like it’s detecting the / inside the group as the ending of the regex, when it isn’t.

Additionally, attempting to put a \ behind the / does not make it see it as a normal character either.

RickAstll3y · August 16, 2023, 11:18am

Just use normal matching

local pattern = "(%d+)([+*-/])(%d+)"
print(string.match("1+1", pattern))

User9684 · August 16, 2023, 12:21pm

That isn’t a solution to the actual problem.

RickAstll3y · August 16, 2023, 4:33pm

Try this regex

(?:[0-9]*(?:[\*-+\/])[0-9]+)+