Getting started with C++TemplatesMetaprogrammingIteratorsReturning several values from a functionstd::stringNamespacesFile I/OClasses/StructuresSmart PointersFunction Overloadingstd::vectorOperator OverloadingLambdasLoopsstd::mapThreadingValue CategoriesPreprocessorSFINAE (Substitution Failure Is Not An Error)The Rule of Three, Five, And ZeroRAII: Resource Acquisition Is InitializationExceptionsImplementation-defined behaviorSpecial Member FunctionsRandom number generationReferencesSortingRegular expressionsPolymorphismPerfect ForwardingVirtual Member FunctionsUndefined BehaviorValue and Reference SemanticsOverload resolutionMove SemanticsPointers to membersPimpl Idiomstd::function: To wrap any element that is callableconst keywordautostd::optionalCopy ElisionBit OperatorsFold ExpressionsUnionsUnnamed typesmutable keywordBit fieldsstd::arraySingleton Design PatternThe ISO C++ StandardUser-Defined LiteralsEnumerationType ErasureMemory managementBit ManipulationArraysPointersExplicit type conversionsRTTI: Run-Time Type InformationStandard Library AlgorithmsFriend keywordExpression templatesScopesAtomic Typesstatic_assertoperator precedenceconstexprDate and time using <chrono> headerTrailing return typeFunction Template OverloadingCommon compile/linker errors (GCC)Design pattern implementation in C++Optimization in C++Compiling and BuildingType Traitsstd::pairKeywordsOne Definition Rule (ODR)Unspecified behaviorFloating Point ArithmeticArgument Dependent Name Lookupstd::variantAttributesInternationalization in C++ProfilingReturn Type CovarianceNon-Static Member FunctionsRecursion in C++Callable Objectsstd::iomanipConstant class member functionsSide by Side Comparisons of classic C++ examples solved via C++ vs C++11 vs C++14 vs C++17The This PointerInline functionsCopying vs AssignmentClient server examplesHeader FilesConst Correctnessstd::atomicsData Structures in C++Refactoring TechniquesC++ StreamsParameter packsLiteralsFlow ControlType KeywordsBasic Type KeywordsVariable Declaration KeywordsIterationtype deductionstd::anyC++11 Memory ModelBuild SystemsConcurrency With OpenMPType Inferencestd::integer_sequenceResource Managementstd::set and std::multisetStorage class specifiersAlignmentInline variablesLinkage specificationsCuriously Recurring Template Pattern (CRTP)Using declarationTypedef and type aliasesLayout of object typesC incompatibilitiesstd::forward_listOptimizationSemaphoreThread synchronization structuresC++ Debugging and Debug-prevention Tools & TechniquesFutures and PromisesMore undefined behaviors in C++MutexesUnit Testing in C++Recursive MutexdecltypeUsing std::unordered_mapDigit separatorsC++ function "call by value" vs. "call by reference"Basic input/output in c++Stream manipulatorsC++ ContainersArithmitic Metaprogramming

Regular expressions

Other topics

Basic regex_match and regex_search Examples

const auto input = "Some people, when confronted with a problem, think \"I know, I'll use regular expressions.\""s;
smatch sm;

cout << input << endl;

// If input ends in a quotation that contains a word that begins with "reg" and another word begining with "ex" then capture the preceeding portion of input
if (regex_match(input, sm, regex("(.*)\".*\\breg.*\\bex.*\"\\s*$"))) {
    const auto capture = sm[1].str();
    
    cout << '\t' << capture << endl; // Outputs: "\tSome people, when confronted with a problem, think\n"
    
    // Search our capture for "a problem" or "# problems"
    if(regex_search(capture, sm, regex("(a|d+)\\s+problems?"))) {
        const auto count = sm[1] == "a"s ? 1 : stoi(sm[1]);
        
        cout << '\t' << count << (count > 1 ? " problems\n" : " problem\n"); // Outputs: "\t1 problem\n"
        cout << "Now they have " << count + 1 << " problems.\n"; // Ouputs: "Now they have 2 problems\n"
    }
}

Live Example

regex_replace Example

This code takes in various brace styles and converts them to One True Brace Style:

const auto input = "if (KnR)\n\tfoo();\nif (spaces) {\n    foo();\n}\nif (allman)\n{\n\tfoo();\n}\nif (horstmann)\n{\tfoo();\n}\nif (pico)\n{\tfoo(); }\nif (whitesmiths)\n\t{\n\tfoo();\n\t}\n"s;

cout << input << regex_replace(input, regex("(.+?)\\s*\\{?\\s*(.+?;)\\s*\\}?\\s*"), "$1 {\n\t$2\n}\n") << endl;

Live Example

regex_token_iterator Example

A std::regex_token_iterator provides a tremendous tool for extracting elements of a Comma Separated Value file. Aside from the advantages of iteration, this iterator is also able to capture escaped commas where other methods struggle:

const auto input = "please split,this,csv, ,line,\\,\n"s;
const regex re{ "((?:[^\\\\,]|\\\\.)+)(?:,|$)" };
const vector<string> m_vecFields{ sregex_token_iterator(cbegin(input), cend(input), re, 1), sregex_token_iterator() };

cout << input << endl;

copy(cbegin(m_vecFields), cend(m_vecFields), ostream_iterator<string>(cout, "\n"));

Live Example

A notable gotcha with regex iterators is, that the regex argument must be an L-value. An R-value will not work.

regex_iterator Example

When processing of captures has to be done iteratively a regex_iterator is a good choice. Dereferencing a regex_iterator returns a match_result. This is great for conditional captures or captures which have interdependence. Let's say that we want to tokenize some C++ code. Given:

enum TOKENS {
    NUMBER,
    ADDITION,
    SUBTRACTION,
    MULTIPLICATION,
    DIVISION,
    EQUALITY,
    OPEN_PARENTHESIS,
    CLOSE_PARENTHESIS
};

We can tokenize this string: const auto input = "42/2 + -8\t=\n(2 + 2) * 2 * 2 -3"s with a regex_iterator like this:

vector<TOKENS> tokens;
const regex re{ "\\s*(\\(?)\\s*(-?\\s*\\d+)\\s*(\\)?)\\s*(?:(\\+)|(-)|(\\*)|(/)|(=))" };

for_each(sregex_iterator(cbegin(input), cend(input), re), sregex_iterator(), [&](const auto& i) {
    if(i[1].length() > 0) {
        tokens.push_back(OPEN_PARENTHESIS);
    }
    
    tokens.push_back(i[2].str().front() == '-' ? NEGATIVE_NUMBER : NON_NEGATIVE_NUMBER);
    
    if(i[3].length() > 0) {
        tokens.push_back(CLOSE_PARENTHESIS);
    }        
    
    auto it = next(cbegin(i), 4);
    
    for(int result = ADDITION; it != cend(i); ++result, ++it) {
        if (it->length() > 0U) {
            tokens.push_back(static_cast<TOKENS>(result));
            break;
        }
    }
});

match_results<string::const_reverse_iterator> sm;

if(regex_search(crbegin(input), crend(input), sm, regex{ tokens.back() == SUBTRACTION ? "^\\s*\\d+\\s*-\\s*(-?)" : "^\\s*\\d+\\s*(-?)" })) {
    tokens.push_back(sm[1].length() == 0 ? NON_NEGATIVE_NUMBER : NEGATIVE_NUMBER);
}

Live Example

A notable gotcha with regex iterators is that the regex argument must be an L-value, an R-value will not work: Visual Studio regex_iterator Bug?

Splitting a string

std::vector<std::string> split(const std::string &str, std::string regex)
{
    std::regex r{ regex };
    std::sregex_token_iterator start{ str.begin(), str.end(), r, -1 }, end;
    return std::vector<std::string>(start, end);
}
split("Some  string\t with whitespace ", "\\s+"); // "Some", "string", "with", "whitespace"

Quantifiers

Let's say that we're given const string input as a phone number to be validated. We could start by requiring a numeric input with a zero or more quantifier: regex_match(input, regex("\\d*")) or a one or more quantifier: regex_match(input, regex("\\d+")) But both of those really fall short if input contains an invalid numeric string like: "123" Let's use a n or more quantifier to ensure that we're getting at least 7 digits:

regex_match(input, regex("\\d{7,}"))

This will guarantee that we will get at least a phone number of digits, but input could also contain a numeric string that's too long like: "123456789012". So lets go with a between n and m quantifier so the input is at least 7 digits but not more than 11:

regex_match(input, regex("\\d{7,11}"));

This gets us closer, but illegal numeric strings that are in the range of [7, 11] are still accepted, like: "123456789" So let's make the country code optional with a lazy quantifier:

regex_match(input, regex("\\d?\\d{7,10}"))

It's important to note that the lazy quantifier matches as few characters as possible, so the only way this character will be matched is if there are already 10 characters that have been matched by \d{7,10}. (To match the first character greedily we would have had to do: \d{0,1}.) The lazy quantifier can be appended to any other quantifier.

Now, how would we make the area code optional and only accept a country code if the area code was present?

regex_match(input, regex("(?:\\d{3,4})?\\d{7}"))

In this final regex, the \d{7} requires 7 digits. These 7 digits are optionally preceded by either 3 or 4 digits.

Note that we did not append the lazy quantifier: \d{3,4}?\d{7}, the \d{3,4}? would have matched either 3 or 4 characters, preferring 3. Instead we're making the non-capturing group match at most once, preferring not to match. Causing a mismatch if input didn't include the area code like: "1234567".


In conclusion of the quantifier topic, I'd like to mention the other appending quantifier that you can use, the possessive quantifier. Either the lazy quantifier or the possessive quantifier can be appended to any quantifier. The possessive quantifier's only function is to assist the regex engine by telling it, greedily take these characters and don't ever give them up even if it causes the regex to fail. This for example doesn't make much sense: regex_match(input, regex("\\d{3,4}+\\d{7})) Because an input like: "1234567890" wouldn't be matched as \d{3,4}+ will always match 4 characters even if matching 3 would have allowed the regex to succeed.
The possessive quantifier is best used when the quantified token limits the number of matchable characters. For example:

regex_match(input, regex("(?:.*\\d{3,4}+){3}"))

Can be used to match if input contained any of the following:

123 456 7890
123-456-7890
(123)456-7890
(123) 456 - 7890

But when this regex really shines is when input contains an illegal input:

12345 - 67890

Without the possessive quantifier the regex engine has to go back and test every combination of .* and either 3 or 4 characters to see if it can find a matchable combination. With the possessive quantifier the regex starts where the 2nd possessive quantifier left off, the '0' character, and the regex engine tries to adjust the .* to allow \d{3,4} to match; when it can't the regex just fails, no back tracking is done to see if earlier .* adjustment could have allowed a match.

Anchors

C++ provides only 4 anchors:

  • ^ which asserts the start of the string
  • $ which asserts the end of the string
  • \b which asserts a \W character or the beginning or end of the string
  • \B which asserts a \w character

Let's say for example we want to capture a number with it's sign:

auto input = "+1--12*123/+1234"s;
smatch sm;

if(regex_search(input, sm, regex{ "(?:^|\\b\\W)([+-]?\\d+)" })) {

    do {
        cout << sm[1] << endl;
        input = sm.suffix().str();
    } while(regex_search(input, sm, regex{ "(?:^\\W|\\b\\W)([+-]?\\d+)" }));
}

Live Example

An important note here is that the anchor does not consume any characters.

Syntax:

  • regex_match // Returns whether the entire character sequence was matched by the regex, optionally capturing into a match object
  • regex_search // Returns whether a portion of the character sequence was matched by the regex, optionally capturing into a match object
  • regex_replace // Returns the input character sequence as modified by a regex via a replacement format string
  • regex_token_iterator // Initialized with a character sequence defined by iterators, a list of capture indexes to iterate over, and a regex. Dereferencing returns the currently indexed match of the regex. Incrementing moves to the next capture index or if currently at the last index, resets the index and hinds the next occurrence of a regex match in the character sequence
  • regex_iterator // Initialized with a character sequence defined by iterators and a regex. Dereferencing returns the portion of the character sequence the entire regex currently matches. Incrementing finds the next occurrence of a regex match in the character sequence

Parameters:

SignatureDescription
bool regex_match(BidirectionalIterator first, BidirectionalIterator last, smatch& sm, const regex& re, regex_constraints::match_flag_type flags)BidirectionalIterator is any character iterator that provides increment and decrement operators smatch may be cmatch or any other other variant of match_results that accepts the type of BidirectionalIterator the smatch argument may be ommitted if the results of the regex are not needed Returns whether re matches the entire character sequence defined by first and last
bool regex_match(const string& str, smatch& sm, const regex re&, regex_constraints::match_flag_type flags)string may be either a const char* or an L-Value string, the functions accepting an R-Value string are explicitly deleted smatch may be cmatch or any other other variant of match_results that accepts the type of str the smatch argument may be ommitted if the results of the regex are not needed Returns whether re matches the entire character sequence defined by str

Contributors

Topic Id: 1681

Example Ids: 5423,5424,5425,5426,13422,23134,23701

This site is not affiliated with any of the contributors.