My summary notes on how to use regular expression in C#.
!!! info Transplanted Post
After setting up this website I am gradually transferring all my previous blogs stored locally and in other blog system towards here. This is one of them. Thus, you should be aware that the date and time displayed on this blog is not accurate.
!!! warning “Expected Reader Experience: Intermediate”
This post is good for readers having reasonable mastery on the topic: Regular Expression. If you are uncertain, try this resource for a beginner’s guide.
Knowledge on the following is also especially helpful: _C# Basic Syntax_.
Regex Basics Cheat Sheet
A quick look-through of my annotated Regex Cheat Sheet for review. This resource is from Microsoft.
Some Confusing Topics
Lazy and Greedy Matching
Quantifiers have lazy and greedy matching variants. Lazy matching asks the regex to match a pattern as short as possible, and greedy matching matches as long as possible.
Greedy Matching:
Regex | String | Result |
---|---|---|
5* |
255555666 |
255555 |
.*at |
The fat cat sat on the mat |
The fat cat sat on the mat |
Lazy Matching:
Regex | String | Result |
---|---|---|
5*? |
255555666 |
25 |
.*?at |
The fat cat sat on the mat |
The fat |
Capturing and Non-capturing Group
()
in a Regex expression does not just create a logical area where the quantifiers act on - it creates a “grouping” which is returned by the matched instances as a “sub-match”.
Regex | String | Result |
---|---|---|
([a-z0-9_-])@([a-z])\.com |
123abc@test.com |
123@test.com group 1: 123abc ; group 2: test |
You have access to the matched instance (in C#, via Match.Value
) and all its groups (Match.Groups
property, which is an array) - because you used ()
to capture and group them.
Of course this compromises performance. If you really do not need this grouping information, use a non-capturing group:
Regex | String | Result |
---|---|---|
(?:[a-z0-9_-])@(?:[a-z])\.com |
123abc@test.com |
123@test.com no group info ( Match.Groups.Count==0 ) |
Lookarounds
Lookaround are special kinds of non-capturing groups. Sometimes, you want a pattern that is preceded/followed by another pattern, but do not want the second pattern to be included into the matched instances returned. In such cases, use lookarounds to not only exclude it from the match’s groups but also the match itself.
Take positive lookahead (use ?=
) for an example:
Regex | String | Result |
---|---|---|
[0-9]+(?=%) |
It improves by 24%. |
24 |
The pattern 24%
is found, but the %
character, which is placed in a positive lookahead, is not included in the match result.
Other kinds of lookaheads function similarly.
Character Escape
Characters . $ ^ { [ ( | ) * + ? \
must be escaped or place in a positive character set []
.
Regex Flags
Regex Flags controls options for regex operations (e.g., whether to do a case-sensitive match or a case-insensitive one). There are two ways to apply flag in C# which are both discussed below.
Regex in Python
import re
re.search(pattern, string) # returns a match object representing the first occurrence of pattern within string
re.sub(pattern, repl, string) # substitutes all matches of pattern within string with repl
re.fullmatch(pattern, string) # returns a match object, requiring that pattern matches the entirety of string
re.match(pattern, string) # returns a match object, requiring that string starts with a substring that matches pattern
re.findall(pattern, string) # returns a list of strings representing all matches of pattern within string, from left to right
Regex in C#
!!! note Dependencies
Namespace: System.Text.RegularExpressions
Assembly: `System.Text.RegularExpressions.dll`
Important Classes
Classes:
Regex
Class: the central class in regular expression operation.Match()
method andMatches()
method: returns the first/all results of matching.IsMatch()
method: returntrue
if there is at least one match andfalse
otherwise.Split()
method: split the strings according to the delimiters specified in regular expression syntax.Replace()
method: replace occurrences of one pattern, which is specified in regular expression syntax, with another string.
Match
Class: a wrapper class used as the return type from methods ofMatch()
methodSuccess
property:bool
, is the match successfulValue
property:string
, the string matched.Groups
property:GroupCollection
, the captured groups.Length
:int
, the length of the string matched.NextMatch()
method: get the next match, starting from the current match position.
MatchCollection
Class: a wrapper class, a collection of Match objects, return byMatches()
method.Count
property:int
, the count of matches in the collection.Item[int index]
: get an individualMatch
object in the collection with the specified index.- Usually we use
foreach
to get allMatch
object in the collection.
Group
Class:Success
property:bool
, is the match successfulCaptures
property:CaptureCollection
, a collection of all the captures matched by the capturing groupGroupCollection
Class has a collection ofGroup
objects.
Capture
Class:Value
property:string
, the captured substring.CaptureCollection
Class has a collection ofCapture
objects.
Enums:
- RegexOptions: a flagged enum storing regex options, commonly used flags are listed below:
IgnoreCase
(1): Set to case-insensitive matching instead the case-sensitive by default.Multiline
(2): Set to multiline mode, which changes the meaning of^
and$
so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string by default.ExplicitCapture
(4): Specifies that the only valid captures are explicitly named or numbered groups of the form (?<name>...
).Singleline
(16): Set to singline mode, which changes meaning of the dot (.
) so it matches every character, instead of every character except\n
be default.IgnorePatternWhitespace
(32): Eliminate unescaped white space from the pattern and enables comments marked with#
.RightToLeft
(64): Set to a right-to-left search, instead of searching from left to right by default.
Use Regex Class
!!! warning Untested Codes
This snippet of codes has not been tested yet and are thus for illustration of concepts only.
//Test strings
string str1 = "123_abc|ABC!789@a2K";
//A Regex object stores a regular expression pattern and regex options.
//If options are not specified, default options are used.
//Remember, RegexOptions is a flagged enum!
Regex reg = new Regex(@"[a-b]+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
Match rlt = reg.Match(str1);
Console.WriteLine(rlt.Value); //abc
//Regex options can also be specified inline:
Regex reg2 = new Regex(@"(?:i)[a-b]+"); //case-insensitive match
Regex reg3 = new Regex(@"(?:-i)[a-b]+"); //case-sensitive match; the "-" inverses the meaning
Regex reg4 = new Regex(@"(?-i)[a-z]+(?i)[k-n]+"); // case sensitive, then case-insensitive match (switch on and off)
Regex reg5 = new Regex(2"(?is-m:expression)"); // set multiple options in one go
//You can also specify the character position in the input string at which to start the search.
Match rlt2 = reg.Match(str1, 7);
Console.WriteLine(rlt2.Value); //ABC
//Get the next match
Match rlt3 = rlt2.NextMatch();
Console.WriteLine(rlt3.Value); //a
//Use Matches to get and print all matches
MatchCollection allRlt = reg.Matches(str1);
foreach (Match rlt in allRlt){
Console.WriteLine(rlt.Value); //prints abc, ABC, a, K
}
//Split strings, using the pattern as delimiters
Regex reg = new Regex(@"[_|!@]", RegexOptions.ExplicitCapture);
string[] strs = reg.Split(str1);
foreach (string rlt in strs){
Console.WriteLine(rlt); //prints abc, ABC, a, K
}
//Replace all numbers with letter Z
Regex reg = new Regex(@"[0-9]+");
string newStr = reg.Replace(str1);
Console.WriteLine(newStr); //Z_abc|ABC!Z@aZK
The Regex Class can also be used as a static class. In such cases, pass the regular expression pattern and the regex options as parameters into the methods such as Match()
.
Match rlt_alt = Regex.Match(str1,@"[a-b]+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
Console.WriteLine(rlt_alt.Value); //abc
Using Groups
!!! warning Not My Codes
This snippet of codes are not my work. They are modified from Microsoft’s Official Documentation.
string pattern = @"(\b(\w+?)[,:;]?\s?)+[?.!]";
string input = "This is one sentence. This is a second sentence.";
Match match = Regex.Match(input, pattern);
if(!match.Success) return;
Console.WriteLine("Match: " + match.Value); //Match: This is one sentence.
int groupCtr = 0;
foreach (Group group in match.Groups)
{
groupCtr++;
Console.WriteLine("Group {0}: '{1}'", groupCtr, group.Value);
int captureCtr = 0;
foreach (Capture capture in group.Captures)
{
captureCtr++;
Console.WriteLine(" Capture {0}: '{1}'", captureCtr, capture.Value);
}
}
//Prints:
//Group 1: 'This is one sentence.'
// Capture 1: 'This is one sentence.'
//Group 2: 'sentence'
// Capture 1: 'This '
// Capture 2: 'is '
// Capture 3: 'one '
// Capture 4: 'sentence'
//Group 3: 'sentence'
// Capture 1: 'This'
// Capture 2: 'is'
// Capture 3: 'one'
// Capture 4: 'sentence'
When to Use Which?
- Use
String
Class when matching a specific string; useRegex
Class when matching a pattern. - Use the static
Regex
Class if the regular expression and the associated options will only be used once; use aRegex
object if it needs to be used multiple times.
- Post link: https://reimirno.github.io/2021/08/10/Regular-Expression-and-Its-Implementation-in-NET/
- Copyright Notice: All articles in this blog are licensed under unless otherwise stated.
GitHub Discussions