Pattern matching

Other topics

Remarks:

Throughout some examples, the notation (<string literal>):function <string literal> is used, which is equivalent to string.function(<string literal>, <string literal>) because all strings have a metatable with the __index field set to the string table.

Lua pattern matching

Instead of using regex, the Lua string library has a special set of characters used in syntax matches. Both can be very similar, but Lua pattern matching is more limited and has a different syntax. For instance, the character sequence %a matches any letter, while its upper-case version represents all non-letters characters, all characters classes (a character sequence that, as a pattern, can match a set of items) are listed below.

Character classMatching section
%aletters (A-Z, a-z)
%ccontrol characters (\n, \t, \r, ...)
%ddigits (0-9)
%llower-case letter (a-z)
%ppunctuation characters (!, ?, &, ...)
%sspace characters
%uupper-case letters
%walphanumeric characters (A-Z, a-z, 0-9)
%xhexadecimal digits (\3, \4, ...)
%zthe character with representation 0
.Matches any character

As mentioned above, any upper-case version of those classes represents the complement of the class. For instance, %D will match any non-digit character sequence:

string.match("f123", "%D")          --> f

In addition to character classes, some characters have special functions as patterns:

( ) % . + - * [ ? ^ $

The character % represents a character escape, making %? match an interrogation and %% match the percentage symbol. You can use the % character with any other non-alphanumeric character, therefore, if you need to escape, for instance, a quote, you must use \\ before it, which escapes any character from a lua string.

A character set, represented inside square brackets ([]), allows you to create a special character class, combining different classes and single characters:

local foo = "bar123bar2341"
print(foo:match "[arb]")            --> b

You can get the complement of the character set by starting it with ^:

local foo = "bar123bar2341"
print(string.match(foo, "[^bar]"))  --> 1

In this example, string.match will find the first occurrence that isn't b, a or r.

Patterns can be more useful with the help of repetition/optional modifiers, patterns in lua offer these four characters:

CharacterModifier
+One or more repetitions
*Zero or more repetitions
-Also zero or more repetitions
?Optional (zero or one occurrence)

The character + represents one or more matched characters in the sequence and it will always return the longest matched sequence:

local foo = "12345678bar123"
print(foo:match "%d+")  --> 12345678

As you can see, * is similar to +, but it accepts zero occurrences of characters and is commonly used to match optional spaces between different patterns.

The character - is also similar to *, but instead of returning the longest matched sequence, it matches the shortest one.

The modifier ? matches an optional character, allowing you to match, for example, a negative digit:

local foo = "-20"
print(foo:match "[+-]?%d+")

Lua pattern matching engine provides a few additional pattern matching items:

Character itemDescription
%nfor n between 1 and 9 matches a substring equal to the n-th captured string
%bxymatches substring between two distinct characters (balanced pair of x and y)
%f[set]frontier pattern: matches an empty string at any position such that the next character
belongs to set and the previous character does not belong to set

string.find (Introduction)

The find function

First let's take a look at the string.find function in general:

The function string.find (s, substr [, init [, plain]]) returns the start and end index of a substring if found, and nil otherwise, starting at the index init if it is provided (defaults to 1).

("Hello, I am a string"):find "am" --> returns 10 11
-- equivalent to string.find("Hello, I am a string", "am") -- see remarks

Introducing Patterns

("hello world"):find ".- " -- will match characters until it finds a space
    --> so it will return 1, 6

All except the following characters represent themselves ^$()%.[]*+-?). Any of these characters can be represented by a % following the character itself.

("137'5 m47ch s0m3 d1g175"):find "m%d%d" -- will match an m followed by 2 digit
    --> this will match m47 and return 7, 9

("stack overflow"):find "[abc]" -- will match an 'a', a 'b' or a 'c'
    --> this will return 3 (the A in stAck)

("stack overflow"):find "[^stack ]"
    -- will match all EXCEPT the letters s, t, a, c and k and the space character
    --> this will match the o in overflow

("hello"):find "o%d?" --> matches o, returns 5, 5
("hello20"):find "o%d?" --> matches o2, returns 5, 6
    -- the ? means the character is optional

("helllllo"):find "el+" --> will match elllll
("heo"):find "el+" --> won't match anything

("helllllo"):find "el*" --> will match elllll
("heo"):find "el*" --> will match e

("helelo"):find "h.+l" -- + will match as much as it gets
    --> this matches "helel"
("helelo"):find "h.-l" -- - will match as few as it can
    --> this wil only match "hel"

("hello"):match "o%d*"
    --> like ?, this matches the "o", because %d is optional
("hello20"):match "o%d*"
    --> unlike ?, it maches as many %d as it gets, "o20"
("hello"):match "o%d"
    --> wouldn't find anything, because + looks for 1 or more characters

The `gmatch` function

How it works

The string.gmatch function will take an input string and a pattern. This pattern describes on what to actually get back. This function will return a function which is actually an iterator. The result of this iterator will match to the pattern.

type(("abc"):gmatch ".") --> returns "function"

for char in ("abc"):gmatch "." do
    print char -- this prints:
    --> a
    --> b
    --> c
end

for match in ("#afdde6"):gmatch "%x%x" do
    print("#" .. match) -- prints:
    --> #af
    --> #dd
    --> #e6
end

Introducing captures:

This is very similair to the regular function, however it will return only the captures instead the full match.

for key, value in ("foo = bar, bar=foo"):gmatch "(%w+)%s*=%s*(%w+)" do
    print("key: " .. key .. ", value: " .. value)
    --> key: foo, value: bar
    --> key: bar, value: foo
end

The gsub function

do not confuse with the string.sub function, which returns a substring!

How it works

string argument

("hello world"):gsub("o", "0")
    --> returns "hell0 w0rld", 2
    -- the 2 means that 2 substrings have been replaced (the 2 Os)

("hello world, how are you?"):gsub("[^%s]+", "word")
    --> returns "word word, word word word?", 5

("hello world"):gsub("([^%s])([^%s]*)", "%2%1")
    --> returns "elloh orldw", 2

function argument

local word = "[^%s]+"

function func(str)
    if str:sub(1,1):lower()=="h" then
        return str
    else
        return "no_h"
    end
end
("hello world"):gsub(word, func)
    --> returns "hello no_h", 2

table argument

local word = "[^%s]+"

sub = {}
sub["hello"] = "g'day"
sub["world"] = "m8"

("hello world"):gsub(word, sub)
    --> returns "g'day m8"

("hello world, how are you?"):gsub(word, sub)
    --> returns "g'day m8, how are you?"
    -- words that are not in the table are simply ignored

Syntax:

  • string.find(str, pattern [, init [, plain]]) -- Returns start and end index of match in str

  • string.match(str, pattern [, index]) -- Matches a pattern once (starting at index)

  • string.gmatch(str, pattern) -- Returns a function that iterates through all matches in str

  • string.gsub(str, pattern, repl [, n]) -- Replaces substrings (up to a max of n times)

  • . represents all characters

  • %a represents all letters

  • %l represents all lowercase letters

  • %u represents all uppercase letters

  • %d represents all digits

  • %x represents all hexadecimal digits

  • %s represents all whitespace characters

  • %p represents all punctuation characters

  • %g represents all printable characters except space

  • %c represents all control characters

  • [set] represents the class which is the union of all characters in set.

  • [^set] represents the complement of set

  • * greedy match 0 or more occurrences of previous character class

  • + greedy match 1 or more occurrences of previous character class

  • - lazy match 0 or more occurrences of previous character class

  • ? match exactly 0 or 1 occurrence of previous character class

Contributors

Topic Id: 5829

Example Ids: 20315,20535,20536,20537

This site is not affiliated with any of the contributors.