Regex modifiers (flags)

Other topics

Remarks:

PCRE Modifiers

ModifierInlineDescription
PCRE_CASELESS(?i)Case insensitive match
PCRE_MULTILINE(?m)Multiple line matching
PCRE_DOTALL(?s). matches new lines
PCRE_ANCHORED(?A)Meta-character ^ matches only at the start
PCRE_EXTENDED(?x)White-spaces are ignored
PCRE_DOLLAR_ENDONLYn/aMeta-character $ matches only at the end
PCRE_EXTRA(?X)Strict escape parsing
PCRE_UTF8Handles UTF-8 characters
PCRE_UTF16Handles UTF-16 characters
PCRE_UTF32Handles UTF-32 characters
PCRE_UNGREEDY(?U)Sets the engine to lazy matching
PCRE_NO_AUTO_CAPTURE(?:)Disables auto-capturing groups

Java Modifiers

Modifier (Pattern.###)ValueDescription
UNIX_LINES1Enables Unix lines mode.
CASE_INSENSITIVE2Enables case-insensitive matching.
COMMENTS4Permits whitespace and comments in a pattern.
MULTILINE8Enables multiline mode.
LITERAL16Enables literal parsing of the pattern.
DOTALL32Enables dotall mode.
UNICODE_CASE64Enables Unicode-aware case folding.
CANON_EQ128Enables canonical equivalence.
UNICODE_CHARACTER_CLASS256Enables the Unicode version of Predefined character classes and POSIX character classes.

DOTALL modifier

A regex pattern where a DOTALL modifier (in most regex flavors expressed with s) changes the behavior of . enabling it to match a newline (LF) symbol:

/cat (.*?) dog/s

This Perl-style regex will match a string like "cat fled from\na dog" capturing "fled from\na" into Group 1.

An inline version: (?s) (e.g. (?s)cat (.*?) dog)

Note: In Ruby, the DOTALL modifier equivalent is m, Regexp::MULTILINE modifier (e.g. /a.*b/m).

Note: JavaScript does not provide a DOTALL modifier, so a . can never be allowed to match a newline character. In order to achieve the same effect, a workaround is necessary, e. g. substituting all the .s with a catch-all character class like [\S\s], or a not nothing character class [^] (however, this construct will be treated as an error by all other engines, and is thus not portable).

MULTILINE modifier

Another example is a MULTILINE modifier (usually expressed with m flag (not in Oniguruma (e.g. Ruby) that uses m to denote a DOTALL modifier)) that makes ^ and $ anchors match the start/end of a line, not the start/end of the whole string.

/^My Line \d+$/gm

will find all lines that start with My Line, then contain a space and 1+ digits up to the line end.

An inline version: (?m) (e.g. (?m)^My Line \d+$)

NOTE: In Oniguruma (e.g. in Ruby), and also in almost any text editors supporting regexps, the ^ and $ anchors denote line start/end positions by default. You need to use \A to define the whole document/string start and \z to denote the document/string end. The difference between the \Z and \z is that the former can match before the final newline (LF) symbol at the end of the string (e.g. /\Astring\Z/ will find a match in "string\n") (except Python, where \Z behavior is equal to \z and \z anchor is not supported).

IGNORE CASE modifier

The common modifier to ignore case is i:

/fog/i

will match Fog, foG, etc.

The inline version of the modifier looks like (?i).

Notes:

In Java, by default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this (CASE_INSENSITIVE) flag. (e.g. Pattern p = Pattern.compile("YOUR_REGEX", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);). Some more on this can be found at Case-Insensitive Matching in Java RegEx. Also, UNICODE_CHARACTER_CLASS can be used to make matching Unicode aware.

VERBOSE / COMMENT / IgnorePatternWhitespace modifier

The modifier that allows using whitespace inside some parts of the pattern to format it for better readability and to allow comments starting with #:

/(?x)^          # start of string
  (?=\D*\d)     # the string should contain at least 1 digit 
  (?!\d+$)      # the string cannot consist of digits only
  \#            # the string starts with a hash symbol
  [a-zA-Z0-9]+ # the string should have 1 or more alphanumeric symbols
  $             # end of string
/

Example of a string: #word1here. Note the # symbol is escaped to denote a literal # that is part of a pattern.

Unescaped white space in the regular expression pattern is ignored, escape it to make it a part of the pattern.

Usually, the whitespace inside character classes ([...]) is treated as a literal whitespace, except in Java.

Also, it is worth mentioning that in PCRE, .NET, Python, Ruby Oniguruma, ICU, Boost regex flavors one can use (?#:...) comments inside the regex pattern.

Explicit Capture modifier

This is a .NET regex specific modifier expressed with n. When used, unnamed groups (like (\d+)) are not captured. Only valid captures are explicitly named groups (e.g. (?<name> subexpression)).

(?n)(\d+)-(\w+)-(?<id>\w+)

will match the whole 123-1_abc-00098, but (\d+) and (\w+) won't create groups in the resulting match object. The only group will be ${id}. See demo.

UNICODE modifier

The UNICODE modifier, usually expressed as u (PHP, Python) or U (Java), makes the regex engine treat the pattern and the input string as Unicode strings and patterns, make the pattern shorthand classes like \w, \d, \s, etc. Unicode-aware.

/\A\p{L}+\z/u

is a PHP regex to match strings that consist of 1 or more Unicode letters. See the regex demo.

Note that in PHP, the /u modifier enables the PCRE engine to handle strings as UTF8 strings (by turning on PCRE_UTF8 verb) and make the shorthand character classes in the pattern Unicode aware (by enabling PCRE_UCP verb, see more at pcre.org).

Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

In Python 2.x, the re.UNICODE only affects the pattern itself: Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.

An inline version: (?u) in Python, (?U) in Java. For example:

print(re.findall(ur"(?u)\w+", u"Dąb")) # [u'D\u0105b']
print(re.findall(r"\w+", u"Dąb"))      # [u'D', u'b']

System.out.println("Dąb".matches("(?U)\\w+")); // true
System.out.println("Dąb".matches("\\w+"));     // false

PCRE_DOLLAR_ENDONLY modifier

The PCRE-compliant PCRE_DOLLAR_ENDONLY modifier that makes the $ anchor match at the very end of the string (excluding the position before the final newline in the string).

/^\d+$/D

is equal to

/^\d+\z/

and matches a whole string that consists of 1 or more digits and will not match "123\n", but will match "123".

PCRE_UNGREEDY modifier

The PCRE-compliant PCRE_UNGREEDY flag expressed with /U. It switches greediness inside a pattern: /a.*?b/U = /a.*b/ and vice versa.

PCRE_INFO_JCHANGED modifier

One more PCRE modifier that allows the use of duplicate named groups.

NOTE: only inline version is supported - (?J), and must be placed at the start of the pattern.

If you use

/(?J)\w+-(?:new-(?<val>\w+)|\d+-empty-(?<val>[^-]+)-collection)/

the "val" group values will be never empty (will always be set). A similar effect can be achieved with branch reset though.

PCRE_EXTRA modifier

A PCRE modifier that causes an error if any backslash in a pattern is followed by a letter that has no special meaning. By default, a backslash followed by a letter with no special meaning is treated as a literal.

E.g.

/big\y/

will match bigy, but

/big\y/X

will throw an exception.

Inline version: (?X)

Contributors

Topic Id: 5138

Example Ids: 18156,18157,18158,18159,18160,18161,18162,18163,18164,18165,18166

This site is not affiliated with any of the contributors.