My summary notes on how to use regular expression in C#.

!!! info Transplanted Post
After setting up this website I am gradually transferring all my previous blogs stored locally and in other blog system towards here. This is one of them. Thus, you should be aware that the date and time displayed on this blog is not accurate.

!!! warning “Expected Reader Experience: Intermediate”
This post is good for readers having reasonable mastery on the topic: Regular Expression. If you are uncertain, try this resource for a beginner’s guide.

Knowledge on the following is also especially helpful: _C# Basic Syntax_.

Regex Basics Cheat Sheet

A quick look-through of my annotated Regex Cheat Sheet for review. This resource is from Microsoft.

Some Confusing Topics

Lazy and Greedy Matching

Quantifiers have lazy and greedy matching variants. Lazy matching asks the regex to match a pattern as short as possible, and greedy matching matches as long as possible.

Greedy Matching:

Regex String Result
5* 255555666 255555
.*at The fat cat sat on the mat The fat cat sat on the mat

Lazy Matching:

Regex String Result
5*? 255555666 25
.*?at The fat cat sat on the mat The fat

Capturing and Non-capturing Group

() in a Regex expression does not just create a logical area where the quantifiers act on - it creates a “grouping” which is returned by the matched instances as a “sub-match”.

Regex String Result
([a-z0-9_-])@([a-z])\.com 123abc@test.com 123@test.com
group 1: 123abc; group 2: test

You have access to the matched instance (in C#, via Match.Value) and all its groups (Match.Groups property, which is an array) - because you used () to capture and group them.

Of course this compromises performance. If you really do not need this grouping information, use a non-capturing group:

Regex String Result
(?:[a-z0-9_-])@(?:[a-z])\.com 123abc@test.com 123@test.com
no group info (Match.Groups.Count==0)

Lookarounds

Lookaround are special kinds of non-capturing groups. Sometimes, you want a pattern that is preceded/followed by another pattern, but do not want the second pattern to be included into the matched instances returned. In such cases, use lookarounds to not only exclude it from the match’s groups but also the match itself.

Take positive lookahead (use ?=) for an example:

Regex String Result
[0-9]+(?=%) It improves by 24%. 24

The pattern 24% is found, but the % character, which is placed in a positive lookahead, is not included in the match result.

Other kinds of lookaheads function similarly.

Character Escape

Characters . $ ^ { [ ( | ) * + ? \ must be escaped or place in a positive character set [].

Regex Flags

Regex Flags controls options for regex operations (e.g., whether to do a case-sensitive match or a case-insensitive one). There are two ways to apply flag in C# which are both discussed below.

Regex in Python

import re
re.search(pattern, string)    # returns a match object representing the first occurrence of pattern within string
re.sub(pattern, repl, string) # substitutes all matches of pattern within string with repl
re.fullmatch(pattern, string) # returns a match object, requiring that pattern matches the entirety of string
re.match(pattern, string)     # returns a match object, requiring that string starts with a substring that matches pattern
re.findall(pattern, string)   # returns a list of strings representing all matches of pattern within string, from left to right

Regex in C#

!!! note Dependencies
Namespace: System.Text.RegularExpressions

Assembly: `System.Text.RegularExpressions.dll`

Important Classes

Classes:

  • Regex Class: the central class in regular expression operation.
    • Match() method and Matches() method: returns the first/all results of matching.
    • IsMatch() method: return true if there is at least one match and false otherwise.
    • Split() method: split the strings according to the delimiters specified in regular expression syntax.
    • Replace() method: replace occurrences of one pattern, which is specified in regular expression syntax, with another string.
  • Match Class: a wrapper class used as the return type from methods of Match() method
    • Success property: bool, is the match successful
    • Value property: string, the string matched.
    • Groups property: GroupCollection, the captured groups.
    • Length: int, the length of the string matched.
    • NextMatch() method: get the next match, starting from the current match position.
  • MatchCollection Class: a wrapper class, a collection of Match objects, return by Matches() method.
    • Count property: int, the count of matches in the collection.
    • Item[int index]: get an individual Match object in the collection with the specified index.
    • Usually we use foreach to get all Match object in the collection.
  • Group Class:
    • Success property: bool, is the match successful
    • Captures property: CaptureCollection, a collection of all the captures matched by the capturing group
    • GroupCollection Class has a collection of Group objects.
  • Capture Class:
    • Value property:string, the captured substring.
    • CaptureCollection Class has a collection of Capture objects.

Enums:

  • RegexOptions: a flagged enum storing regex options, commonly used flags are listed below:
    • IgnoreCase(1): Set to case-insensitive matching instead the case-sensitive by default.
    • Multiline(2): Set to multiline mode, which changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string by default.
    • ExplicitCapture(4): Specifies that the only valid captures are explicitly named or numbered groups of the form (?<name>...).
    • Singleline(16): Set to singline mode, which changes meaning of the dot (.) so it matches every character, instead of every character except \n be default.
    • IgnorePatternWhitespace(32): Eliminate unescaped white space from the pattern and enables comments marked with #.
    • RightToLeft(64): Set to a right-to-left search, instead of searching from left to right by default.

Use Regex Class

!!! warning Untested Codes
This snippet of codes has not been tested yet and are thus for illustration of concepts only.

//Test strings
string str1 = "123_abc|ABC!789@a2K";

//A Regex object stores a regular expression pattern and regex options.
//If options are not specified, default options are used.
//Remember, RegexOptions is a flagged enum!
Regex reg = new Regex(@"[a-b]+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
Match rlt = reg.Match(str1);
Console.WriteLine(rlt.Value); //abc

//Regex options can also be specified inline:
Regex reg2 = new Regex(@"(?:i)[a-b]+"); //case-insensitive match
Regex reg3 = new Regex(@"(?:-i)[a-b]+"); //case-sensitive match; the "-" inverses the meaning
Regex reg4 = new Regex(@"(?-i)[a-z]+(?i)[k-n]+"); // case sensitive, then case-insensitive match (switch on and off)
Regex reg5 = new Regex(2"(?is-m:expression)"); // set multiple options in one go

//You can also specify the character position in the input string at which to start the search.
Match rlt2 = reg.Match(str1, 7);
Console.WriteLine(rlt2.Value); //ABC

//Get the next match
Match rlt3 = rlt2.NextMatch();
Console.WriteLine(rlt3.Value); //a

//Use Matches to get and print all matches
MatchCollection allRlt = reg.Matches(str1);
foreach (Match rlt in allRlt){
  Console.WriteLine(rlt.Value); //prints abc, ABC, a, K
}

//Split strings, using the pattern as delimiters
Regex reg = new Regex(@"[_|!@]", RegexOptions.ExplicitCapture);
string[] strs = reg.Split(str1);
foreach (string rlt in strs){
  Console.WriteLine(rlt); //prints abc, ABC, a, K
}

//Replace all numbers with letter Z
Regex reg = new Regex(@"[0-9]+");
string newStr = reg.Replace(str1);
Console.WriteLine(newStr); //Z_abc|ABC!Z@aZK

The Regex Class can also be used as a static class. In such cases, pass the regular expression pattern and the regex options as parameters into the methods such as Match().

Match rlt_alt = Regex.Match(str1,@"[a-b]+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
Console.WriteLine(rlt_alt.Value); //abc

Using Groups

!!! warning Not My Codes
This snippet of codes are not my work. They are modified from Microsoft’s Official Documentation.

string pattern = @"(\b(\w+?)[,:;]?\s?)+[?.!]";
string input = "This is one sentence. This is a second sentence.";

Match match = Regex.Match(input, pattern);
if(!match.Success) return;
Console.WriteLine("Match: " + match.Value); //Match: This is one sentence.
int groupCtr = 0;
foreach (Group group in match.Groups)
{
  groupCtr++;
  Console.WriteLine("Group {0}: '{1}'", groupCtr, group.Value);
  int captureCtr = 0;
  foreach (Capture capture in group.Captures)
  {
    captureCtr++;
    Console.WriteLine("   Capture {0}: '{1}'", captureCtr, capture.Value);
  }
}
//Prints:
//Group 1: 'This is one sentence.'
//   Capture 1: 'This is one sentence.'
//Group 2: 'sentence'
//   Capture 1: 'This '
//   Capture 2: 'is '
//   Capture 3: 'one '
//   Capture 4: 'sentence'
//Group 3: 'sentence'
//   Capture 1: 'This'
//   Capture 2: 'is'
//   Capture 3: 'one'
//   Capture 4: 'sentence'

When to Use Which?

  • Use String Class when matching a specific string; use Regex Class when matching a pattern.
  • Use the static Regex Class if the regular expression and the associated options will only be used once; use a Regex object if it needs to be used multiple times.