2023 update: I will likely not maintain the 2020 implementation. I might re-implement this in the future but unfortunately, I have no plans to do so now. Apologies for any inconvenience.
NOTE: This isn’t bug-free and many features aren’t available as this is still in WIP as it’s in alpha
Find the Lua’s string pattern library lacking? No problem, with this RegEx implemention you get to use more RegEx features that “other” programming language has in Luau. This is a PCRE2-based so don’t be surprised if your RegEx pattern from ECMAScript, Python, .NET, etc may not work in a way as you expect.
Movitation to create this
I’ve wanted a better string pattern library module, that includes features like altneration. I’ve looked over the community resouces and found nothing on RegEx.
Why should I use this over Lua’s string library?
Lua’s builtin string pattern library are quite lacking in terms of features, while not nessesary a bad thing, it can be “inconvenient” at because of this. This RegEx implemention has way more useful features like alternation, non-capturing group and group repetition.
Features
Alternation
Want to match either a or bc but not ac or b? You’re looking for altneration. For example a|bc
only matches a
or bc
. It can be inside a group so a(?:bc|d)
only matches either abc
or ad
.
Group repetition
Want to match groups? /(abc)?/
only matches abc
or not, nothing less.
Non-capturing groups
Hey, don’t want to capture the group but want to create a group? You can use non-capturing groups. For example (?:ab)*
matches ab
without capturing it
Lookarounds
Want to require something to be matched without actually matching it? Lookarounds got you.
Flags
Like others we have flags in this RegEx
Flag | Name | Description |
---|---|---|
i | case insensitive | Ignore ASCII cases |
m | multiline |
^ starts the beginning of the line |
s | single line (dotAll) |
. matches new line |
u | unicode | Match where patterns like \d , \w , posix classes, etc are in Unicode instead of ASCII. |
x | extended | Ignore whitespaces (unless in character set), and treat from # to the end of the line as a comment. |
Arbitary amount of repetition
Lua only restricts this with {0,1}
(?
), {0,}
(*
), {1,}
(+
) and {0,}?
, with this having choosing the amount of times you want it to match in any value.
For example a{2,}
match if it has at least 2 characters that are a
and a{3,5}
match if has at least 3
characters that are a
with maximum of 5
characters to much
Strings interpreted as UTF-8
So characters like 字
won’t be counted twice. You might thank me for later.
Lazy and possesive quantifiers
Want to make it match as few times as possible? Or match without? Use ?
(lazy) or +
(possesive) after a quantifier!
Other features
There are some features that I’d like to mention; you probably don’t need them.
Comments
Want to comment somehthing when needed or needlessly? No problem, you can use the (?#...)
syntax. Alternatively you can use #
as a comment with the extended flag.
Atomic group
Disable backtracking for the group, e.g. after bc
has been matched in a(?>bc|b)c
.
POSIX classes
In the syntax of [:name:]
inside a character set. Maybe you prefer this style over escape codes as a class?
POSIX class | Equivalency | Equivalency with Unicode flag | |
---|---|---|---|
[:alnum:] | Alphanumerical characters | [a-zA-Z0-9] | [\p{L}\p{Nl}\p{Nd}] |
[:alpha:] | Alphabetical characters | [a-zA-Z] | \p{L}\p{Nl} |
[:ascii:] | ASCII characters | [\x00-\x7F] | |
[:blank:] | Space and tab | [ \t] | [\p{Zs}\t] |
[:cntrl:] | Control characters | [\x00-\x1F\x7F] | \p{Cc} |
[:digit:] | Digits | [0-9] | \p{Nd} |
[:graph:] | Visible characters (anything aside from spaces and control characters) | [\x21-\x7E] | [^\p{Z}\p{C}] |
[:lower:] | Lowercase letters | [a-z] | \p{Ll} |
[:print:] | Visible characters and spaces (anything aside from control characters) | [\x20-\x7E] | \P{C} |
[:punct:] | Symbols | [!"#$%&'()*+,-./:;<=>?@[\]^_‘{|}~] | \p{P} |
[:space:] | Whitespaces, incl. line breaks | [ \t\r\n\v\f] | [\p{Z}\t\r\n\v\f] |
[:upper:] | Uppercase letters | [A-Z] | \p{Lu} |
[:word:] | Word characters (letters, numbers and underbars) | [A-Za-z0-9_] | [\p{L}\p{Nl}\p{Nd}\p{Pc}] |
[:xdigit:] | Hex digits | [A-Fa-f0-9] |
API
(The API is inspired by Python’s re
module)
RegEx RegEx.new(string pattern, string flags)
Creates a RegEx pattern without delimiters.
RegEx RegEx.fromstring(string string)
Parses a RegEx literal to compile it such as /example/i
, delimiters, although I recommend /
as a delimiter, are your choice and can be any character except for backslashes or alphanumerical ASCII characters. So /example/i
, ~example~i
, %example%i
, etc are invalid but 1example1i
, \example\i
, aexamplea1
etc aren’t. To get the literal character of the delimiter without closing it, escape it with a backslash e.g. /\//i
is interpreted as \
with a flag of i
string RegEx.escape(string string)
Escapes the string so it’ll be treated as plain text in this RegEx flavour.
string/nil RegEx.type(any value)
Returns "RegEx"
if it’s a RegEx pattern created using RegEx.new or RegEx.fromstring.
Returns "Match"
if it’s a RegEx match created by RegEx.match or created via RegEx.sub that are passed to the repl
argument if repl
is a function.
Otherwise it returns nil
.
RegEx.Match class
RegEx.Match class can only be created by RegEx.match
, RegEx.matchall
and as an argument passed to RegEx.sub
the repl
argument is a function
number, number RegEx.Match.span(Match match, number/string group)
Returns the span of the matched RegEx
string RegEx.Match.group(Match match, number/string group)
Returns the match of the group as a string
string… RegEx.Match.groups(Match match)
Returns multiple match as a string depending on how many capturing groups.
If there are no capturing groups, return one entire match as a string.
table RegEx.Match.grouparr(Match match)
Returns the match as a table in array form along with the n
key for the length (Why? because some matches might contain nil
, for example this RegEx.match("/(abc)(def)?/", "abc"):grouparr()
has a length of 2
but item 2
is nil
).
table RegEx.Match.groupdict(Match match)
Returns the match as a table, with the key as the named capturing groups.
If there’s not named capturing groups, it’ll return an empty table
Methods
You can use the method call for these e.g. RegEx.new(pattern, flag):match(string)
for RegEx patterns or directly call it it. Both RegEx.match("/pattern/", string)
and RegEx.match(RegEx.new(pattern, flag), string)
are accepted.
boolean RegEx.test(RegEx pattern, string string, number init)
A boolean to see does the RegEx pattern match
Match/nil RegEx.match(RegEx pattern, string string, number init)
Returns the match. If the match cannot be found, returns nil
.
function RegEx.matchall(RegEx pattern, string string, number init)
Returns a function that acts like an iterator. Whenever it’s called it’ll return the next match.
string, number RegEx.sub(RegEx pattern, string/function[/table] repl, string string, number n, boolean match_class, string flags)
Returns a string where all are occurences of a pattern has been replaced by repl
which can either be a string, a function or a table if it has the o
flag and a number of occurences it replaces.
if repl
is a string:
It’ll substitute the value with the following format if it has $
or \
:
Substitute characters | Description |
---|---|
$number | Substitute the character by the capturing group identified by the number |
${name/number} | Substitute the character by the capturing group identified by the name or a number if it’s a number |
\number | Same as $number |
if repl
is a function/table:
It’ll pass a match
class as an argument and call the function/get the index of the table of the full match string, then get the first returned value (if it returns no value then it’ll be treated as nil
), if the returned value is a string or a number then the replacement is literally (no substitions) the returned value (converted to string if it’s a number), if the returned value is a Match object that was passed as the argument, otherwise the replacement will be treated as an empty string (or the entire match if it’s false
or nil
and it has the o
flag).
In version 1.0.0a2+ there’s a flag:
Flag | Name | Description |
---|---|---|
o | Lua 5.1 (one) mode | This module defaults to Lua 5.0, turn on this flag so it accepts table as the argument, false and nil returns full match, and any other value errors |
l | literal | The repl argument is now a literal |
u | unknown unset | Unknown groups are treated as an unset |
table RegEx.split(RegEx pattern, string string, number n)
Splits the RegEx into table with the pattern
argument as the separator pattern. For example RegEx.split("/[[:space:]]/", "hello world")
returns {"hello", "world"}
Differences from PCRE2
This flavour of RegEx have many differences from PCRE2:
- POSIX syntax for word boundaries (
[[:<:]]
and[[:>:]]
) aren’t available as it’s a legacy feature that I presume only exists for compatibility. I have no plans to include it. - This flavour of RegEx have undefined limit while PCRE2 have specified limit,
- This flavour doesn’t have version condition so something like
(?(VERSION>=version)before|after)
isn’t valid. I might include it but why do you need it? - The only verbs that are supported are newlines conventions
(*CR), (*LF), (*CRLF), (*ANYCRLF), (*ANY) and (*NUL)
, accept, fail, prune and skip(*ACCEPT), (*FAIL), (*PRUNE) and (*SKIP)
and lookarounds(*pla:), (*plb:), (*nla:) and (*nlb:)
.
and features that this RegEx implementtion currently not available for now but are planned to be implemented:
- Internal option settings
-
\Q
and\E
literals - Unicode extended grapheme cluster
\X
- Conditional groups
- Control escape characters e.g. Ctrl + X is
\cX
- Probably many more.
Why is it PCRE2-based? Why isn’t it ${insert any other RegEx flavour}-based?
Because it’s my decision ;), things like this gets arbitary. I don’t think there’s a reason behind this. I can decide to make it ECMAScript, ICU, .NET, Python or POSIX (BRE or ERE) flavour based of RegEx instead if I wanted to .
Where to get it?
You can get it here (2020 implementation)
1.0.0a2 (expat licence version as requested by someone I won’t name for privacy reasons): RegEx (expat licence version).rbxm (161.5 KB)
1.0.0a2 (BSD 2-Clause licence): RegEx.rbxm (161.5 KB)
1.0.0a1: RegEx.rbxm (156.9 KB)