Getting started with C++Templates Metaprogramming Iterators Returning several values from a function std::string Namespaces File I/O Classes/Structures Smart Pointers Function Overloading std::vector Operator Overloading Lambdas Loops std::map Threading Value Categories Preprocessor SFINAE (Substitution Failure Is Not An Error)The Rule of Three, Five, And Zero RAII: Resource Acquisition Is Initialization Exceptions Implementation-defined behavior Special Member Functions Random number generation References Sorting Regular expressions Polymorphism Perfect Forwarding Virtual Member Functions Undefined Behavior Value and Reference Semantics Overload resolution Move Semantics Pointers to members Pimpl Idiom std::function: To wrap any element that is callable const keyword auto std::optional Copy Elision Bit Operators Fold Expressions Unions Unnamed types mutable keyword Bit fields std::array Singleton Design Pattern The ISO C++ Standard User-Defined Literals Enumeration Type Erasure Memory management Bit Manipulation Arrays Pointers Explicit type conversions RTTI: Run-Time Type Information Standard Library Algorithms Friend keyword Expression templates Scopes Atomic Types static_assert operator precedence constexpr Date and time using <chrono> header Trailing return type Function Template Overloading Common compile/linker errors (GCC)Design pattern implementation in C++Optimization in C++Compiling and Building Type Traits std::pair Keywords One Definition Rule (ODR)Unspecified behavior Floating Point Arithmetic Argument Dependent Name Lookup std::variant Attributes Internationalization in C++Profiling Return Type Covariance Non-Static Member Functions Recursion in C++Callable Objects std::iomanip Constant class member functions Side by Side Comparisons of classic C++ examples solved via C++ vs C++11 vs C++14 vs C++17 The This Pointer Inline functions Copying vs Assignment Client server examples Header Files Const Correctness std::atomics Data Structures in C++Refactoring Techniques C++ Streams Parameter packs Literals Flow Control Type Keywords Basic Type Keywords Variable Declaration Keywords Iteration type deduction std::any C++11 Memory Model Build Systems Concurrency With OpenMP Type Inference std::integer_sequence Resource Management std::set and std::multiset Storage class specifiers Alignment Inline variables Linkage specifications Curiously Recurring Template Pattern (CRTP)Using declaration Typedef and type aliases Layout of object types C incompatibilities std::forward_list Optimization Semaphore Thread synchronization structures C++ Debugging and Debug-prevention Tools & Techniques Futures and Promises More undefined behaviors in C++Mutexes Unit Testing in C++Recursive Mutex decltype Using std::unordered_map Digit separators C++ function "call by value" vs. "call by reference"Basic input/output in c++Stream manipulators C++ Containers Arithmitic Metaprogramming

Regular expressions

Basic regex_match and regex_search Examples

const auto input = "Some people, when confronted with a problem, think \"I know, I'll use regular expressions.\""s;
smatch sm;

cout << input << endl;

// If input ends in a quotation that contains a word that begins with "reg" and another word begining with "ex" then capture the preceeding portion of input
if (regex_match(input, sm, regex("(.*)\".*\\breg.*\\bex.*\"\\s*$"))) {
    const auto capture = sm[1].str();
    
    cout << '\t' << capture << endl; // Outputs: "\tSome people, when confronted with a problem, think\n"
    
    // Search our capture for "a problem" or "# problems"
    if(regex_search(capture, sm, regex("(a|d+)\\s+problems?"))) {
        const auto count = sm[1] == "a"s ? 1 : stoi(sm[1]);
        
        cout << '\t' << count << (count > 1 ? " problems\n" : " problem\n"); // Outputs: "\t1 problem\n"
        cout << "Now they have " << count + 1 << " problems.\n"; // Ouputs: "Now they have 2 problems\n"
    }
}

Live Example

regex_replace Example

This code takes in various brace styles and converts them to One True Brace Style:

const auto input = "if (KnR)\n\tfoo();\nif (spaces) {\n    foo();\n}\nif (allman)\n{\n\tfoo();\n}\nif (horstmann)\n{\tfoo();\n}\nif (pico)\n{\tfoo(); }\nif (whitesmiths)\n\t{\n\tfoo();\n\t}\n"s;

cout << input << regex_replace(input, regex("(.+?)\\s*\\{?\\s*(.+?;)\\s*\\}?\\s*"), "$1 {\n\t$2\n}\n") << endl;

Live Example

regex_token_iterator Example

A std::regex_token_iterator provides a tremendous tool for extracting elements of a Comma Separated Value file. Aside from the advantages of iteration, this iterator is also able to capture escaped commas where other methods struggle:

const auto input = "please split,this,csv, ,line,\\,\n"s;
const regex re{ "((?:[^\\\\,]|\\\\.)+)(?:,|$)" };
const vector<string> m_vecFields{ sregex_token_iterator(cbegin(input), cend(input), re, 1), sregex_token_iterator() };

cout << input << endl;

copy(cbegin(m_vecFields), cend(m_vecFields), ostream_iterator<string>(cout, "\n"));

Live Example

A notable gotcha with regex iterators is, that the regex argument must be an L-value. An R-value will not work.

regex_iterator Example

When processing of captures has to be done iteratively a regex_iterator is a good choice. Dereferencing a regex_iterator returns a match_result. This is great for conditional captures or captures which have interdependence. Let's say that we want to tokenize some C++ code. Given:

enum TOKENS {
    NUMBER,
    ADDITION,
    SUBTRACTION,
    MULTIPLICATION,
    DIVISION,
    EQUALITY,
    OPEN_PARENTHESIS,
    CLOSE_PARENTHESIS
};

We can tokenize this string: const auto input = "42/2 + -8\t=\n(2 + 2) * 2 * 2 -3"s with a regex_iterator like this:

vector<TOKENS> tokens;
const regex re{ "\\s*(\\(?)\\s*(-?\\s*\\d+)\\s*(\\)?)\\s*(?:(\\+)|(-)|(\\*)|(/)|(=))" };

for_each(sregex_iterator(cbegin(input), cend(input), re), sregex_iterator(), [&](const auto& i) {
    if(i[1].length() > 0) {
        tokens.push_back(OPEN_PARENTHESIS);
    }
    
    tokens.push_back(i[2].str().front() == '-' ? NEGATIVE_NUMBER : NON_NEGATIVE_NUMBER);
    
    if(i[3].length() > 0) {
        tokens.push_back(CLOSE_PARENTHESIS);
    }        
    
    auto it = next(cbegin(i), 4);
    
    for(int result = ADDITION; it != cend(i); ++result, ++it) {
        if (it->length() > 0U) {
            tokens.push_back(static_cast<TOKENS>(result));
            break;
        }
    }
});

match_results<string::const_reverse_iterator> sm;

if(regex_search(crbegin(input), crend(input), sm, regex{ tokens.back() == SUBTRACTION ? "^\\s*\\d+\\s*-\\s*(-?)" : "^\\s*\\d+\\s*(-?)" })) {
    tokens.push_back(sm[1].length() == 0 ? NON_NEGATIVE_NUMBER : NEGATIVE_NUMBER);
}

Live Example

A notable gotcha with regex iterators is that the regex argument must be an L-value, an R-value will not work: Visual Studio regex_iterator Bug?

Splitting a string

std::vector<std::string> split(const std::string &str, std::string regex)
{
    std::regex r{ regex };
    std::sregex_token_iterator start{ str.begin(), str.end(), r, -1 }, end;
    return std::vector<std::string>(start, end);
}

split("Some  string\t with whitespace ", "\\s+"); // "Some", "string", "with", "whitespace"

Quantifiers

Let's say that we're given const string input as a phone number to be validated. We could start by requiring a numeric input with a zero or more quantifier: regex_match(input, regex("\\d*")) or a one or more quantifier: regex_match(input, regex("\\d+")) But both of those really fall short if input contains an invalid numeric string like: "123" Let's use a n or more quantifier to ensure that we're getting at least 7 digits:

regex_match(input, regex("\\d{7,}"))

This will guarantee that we will get at least a phone number of digits, but input could also contain a numeric string that's too long like: "123456789012". So lets go with a between n and m quantifier so the input is at least 7 digits but not more than 11:

regex_match(input, regex("\\d{7,11}"));

This gets us closer, but illegal numeric strings that are in the range of [7, 11] are still accepted, like: "123456789" So let's make the country code optional with a lazy quantifier:

regex_match(input, regex("\\d?\\d{7,10}"))

It's important to note that the lazy quantifier matches as few characters as possible, so the only way this character will be matched is if there are already 10 characters that have been matched by \d{7,10}. (To match the first character greedily we would have had to do: \d{0,1}.) The lazy quantifier can be appended to any other quantifier.

Now, how would we make the area code optional and only accept a country code if the area code was present?

regex_match(input, regex("(?:\\d{3,4})?\\d{7}"))

In this final regex, the \d{7} requires 7 digits. These 7 digits are optionally preceded by either 3 or 4 digits.

Note that we did not append the lazy quantifier: ~~\d{3,4}?\d{7}~~, the \d{3,4}? would have matched either 3 or 4 characters, preferring 3. Instead we're making the non-capturing group match at most once, preferring not to match. Causing a mismatch if input didn't include the area code like: "1234567".

In conclusion of the quantifier topic, I'd like to mention the other appending quantifier that you can use, the possessive quantifier. Either the lazy quantifier or the possessive quantifier can be appended to any quantifier. The possessive quantifier's only function is to assist the regex engine by telling it, greedily take these characters and don't ever give them up even if it causes the regex to fail. This for example doesn't make much sense: regex_match(input, regex("\\d{3,4}+\\d{7})) Because an input like: "1234567890" wouldn't be matched as \d{3,4}+ will always match 4 characters even if matching 3 would have allowed the regex to succeed.
The possessive quantifier is best used when the quantified token limits the number of matchable characters. For example:

regex_match(input, regex("(?:.*\\d{3,4}+){3}"))

Can be used to match if input contained any of the following:

123 456 7890
123-456-7890
(123)456-7890
(123) 456 - 7890

But when this regex really shines is when input contains an illegal input:

12345 - 67890

Without the possessive quantifier the regex engine has to go back and test every combination of .* and either 3 or 4 characters to see if it can find a matchable combination. With the possessive quantifier the regex starts where the 2^nd possessive quantifier left off, the '0' character, and the regex engine tries to adjust the .* to allow \d{3,4} to match; when it can't the regex just fails, no back tracking is done to see if earlier .* adjustment could have allowed a match.

Anchors

C++ provides only 4 anchors:

^ which asserts the start of the string
$ which asserts the end of the string
\b which asserts a \W character or the beginning or end of the string
\B which asserts a \w character

Let's say for example we want to capture a number with it's sign:

auto input = "+1--12*123/+1234"s;
smatch sm;

if(regex_search(input, sm, regex{ "(?:^|\\b\\W)([+-]?\\d+)" })) {

    do {
        cout << sm[1] << endl;
        input = sm.suffix().str();
    } while(regex_search(input, sm, regex{ "(?:^\\W|\\b\\W)([+-]?\\d+)" }));
}

Live Example

An important note here is that the anchor does not consume any characters.

Syntax:

regex_match // Returns whether the entire character sequence was matched by the regex, optionally capturing into a match object
regex_search // Returns whether a portion of the character sequence was matched by the regex, optionally capturing into a match object
regex_replace // Returns the input character sequence as modified by a regex via a replacement format string
regex_token_iterator // Initialized with a character sequence defined by iterators, a list of capture indexes to iterate over, and a regex. Dereferencing returns the currently indexed match of the regex. Incrementing moves to the next capture index or if currently at the last index, resets the index and hinds the next occurrence of a regex match in the character sequence
regex_iterator // Initialized with a character sequence defined by iterators and a regex. Dereferencing returns the portion of the character sequence the entire regex currently matches. Incrementing finds the next occurrence of a regex match in the character sequence

Parameters:

Signature Description

bool regex_match(BidirectionalIterator first, BidirectionalIterator last, smatch& sm, const regex& re, regex_constraints::match_flag_type flags) BidirectionalIterator is any character iterator that provides increment and decrement operators smatch may be cmatch or any other other variant of match_results that accepts the type of BidirectionalIterator the smatch argument may be ommitted if the results of the regex are not needed Returns whether re matches the entire character sequence defined by first and last

bool regex_match(const string& str, smatch& sm, const regex re&, regex_constraints::match_flag_type flags) string may be either a const char* or an L-Value string, the functions accepting an R-Value string are explicitly deleted smatch may be cmatch or any other other variant of match_results that accepts the type of str the smatch argument may be ommitted if the results of the regex are not needed Returns whether re matches the entire character sequence defined by str

Contributors

Topic Id: 1681

Example Ids: 5423,5424,5425,5426,13422,23134,23701

This site is not affiliated with any of the contributors.