-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Remove positive lookarounds that wrap only zero-width assertions #118091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A positive lookahead effectively changes its contents to be zero-width. If the contents is already zero-width, the lookaround adds no value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements an optimization to remove redundant positive lookarounds that wrap only zero-width assertions. When a positive lookahead contains only zero-width assertions (like anchors or word boundaries), the lookahead wrapper serves no purpose since the contained assertions are already zero-width and don't consume input.
Key changes:
- Adds logic to detect and remove unnecessary positive lookaround wrappers around zero-width assertions
- Updates test expectations to reflect the new optimization behavior
- Refines anchor detection logic to properly handle boundary assertions within lookarounds
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| RegexNode.cs | Implements the core optimization logic to detect zero-width assertions and remove redundant positive lookaround wrappers |
| RegexPrefixAnalyzer.cs | Updates anchor detection to properly handle boundary assertions within lookarounds by moving boundary checks to the zero-width assertion skip logic |
| RegexReductionTests.cs | Adds comprehensive test cases covering various scenarios of lookaround reduction with zero-width assertions |
| RegexFindOptimizationsTests.cs | Updates test expectation to reflect improved anchor detection after lookaround removal |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
|
@MihuBot regexdiff |
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
|
173 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"(((?<=\\W|^)-\\s*)|(?<=\\b))\\d+\\s*(B|b|m|t ..." (540 uses)[GeneratedRegex("(((?<=\\W|^)-\\s*)|(?<=\\b))\\d+\\s*(B|b|m|t|g)(?=\\b)", RegexOptions.Singleline)] /// ○ Match a whitespace character atomically any number of times.<br/>
/// ○ 3rd capture group.<br/>
/// ○ Match a character in the set [Bbgmt].<br/>
- /// ○ Zero-width positive lookahead.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
base.Capture(3, capture_starting_pos2, pos);
}
- // Zero-width positive lookahead.
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
{
- slice = inputSpan.Slice(pos);
- int positivelookahead_starting_pos = pos;
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CaptureBacktrack;
- }
-
- pos = positivelookahead_starting_pos;
- slice = inputSpan.Slice(pos);
+ goto CaptureBacktrack;
}
// The input matched."(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))( ..." (338 uses)[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))(\\d+(,\\d+)?)\\^([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.IgnoreCase | RegexOptions.Singleline)] /// ○ Match a character in the set [+\-] atomically any number of times.<br/>
/// ○ Match a character in the set [1-9].<br/>
/// ○ Match a Unicode digit greedily any number of times.<br/>
- /// ○ Zero-width positive lookahead.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
CaptureSkipBacktrack2:;
//}
- // Zero-width positive lookahead.
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
{
- slice = inputSpan.Slice(pos);
- int positivelookahead_starting_pos = pos;
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CaptureBacktrack2;
- }
-
- pos = positivelookahead_starting_pos;
- slice = inputSpan.Slice(pos);
+ goto CaptureBacktrack2;
}
// The input matched."(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))( ..." (338 uses)[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))(\\d+(,\\d+)?)e([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.IgnoreCase | RegexOptions.Singleline)] /// ○ Match a character in the set [+\-] atomically any number of times.<br/>
/// ○ Match a character in the set [1-9].<br/>
/// ○ Match a Unicode digit greedily any number of times.<br/>
- /// ○ Zero-width positive lookahead.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
CaptureSkipBacktrack2:;
//}
- // Zero-width positive lookahead.
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
{
- slice = inputSpan.Slice(pos);
- int positivelookahead_starting_pos = pos;
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CaptureBacktrack2;
- }
-
- pos = positivelookahead_starting_pos;
- slice = inputSpan.Slice(pos);
+ goto CaptureBacktrack2;
}
// The input matched."(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+\\,)) ..." (326 uses)[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+\\,)))\\d+,\\d+\\s*(K|k|M|G|T)(?=\\b)", RegexOptions.Singleline)] /// ○ Match a whitespace character atomically any number of times.<br/>
/// ○ 4th capture group.<br/>
/// ○ Match a character in the set [GKMTk].<br/>
- /// ○ Zero-width positive lookahead.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
base.Capture(4, capture_starting_pos3, pos);
}
- // Zero-width positive lookahead.
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
{
- slice = inputSpan.Slice(pos);
- int positivelookahead_starting_pos = pos;
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CaptureBacktrack;
- }
-
- pos = positivelookahead_starting_pos;
- slice = inputSpan.Slice(pos);
+ goto CaptureBacktrack;
}
// The input matched."(((?<=\\W|^)-\\s*)|(?<=\\b))\\d+\\s*(k|M|T|G ..." (325 uses)[GeneratedRegex("(((?<=\\W|^)-\\s*)|(?<=\\b))\\d+\\s*(k|M|T|G)(?=\\b)", RegexOptions.Singleline)] /// ○ Match a whitespace character atomically any number of times.<br/>
/// ○ 3rd capture group.<br/>
/// ○ Match a character in the set [GMTk].<br/>
- /// ○ Zero-width positive lookahead.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
base.Capture(3, capture_starting_pos2, pos);
}
- // Zero-width positive lookahead.
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
{
- slice = inputSpan.Slice(pos);
- int positivelookahead_starting_pos = pos;
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CaptureBacktrack;
- }
-
- pos = positivelookahead_starting_pos;
- slice = inputSpan.Slice(pos);
+ goto CaptureBacktrack;
}
// The input matched."(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))( ..." (252 uses)[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))(\\d+(,\\d+)?)\\^([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)] /// ○ Match a character in the set [+\-] atomically any number of times.<br/>
/// ○ Match a character in the set [1-9].<br/>
/// ○ Match a Unicode digit greedily any number of times.<br/>
- /// ○ Zero-width positive lookahead.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
CharLoopEnd2:
//}
- // Zero-width positive lookahead.
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
{
- slice = inputSpan.Slice(pos);
- int positivelookahead_starting_pos = pos;
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CharLoopBacktrack2;
- }
-
- pos = positivelookahead_starting_pos;
- slice = inputSpan.Slice(pos);
+ goto CharLoopBacktrack2;
}
// The input matched."(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))( ..." (252 uses)[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))(\\d+(,\\d+)?)e([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)] /// ○ Match a character in the set [+\-] atomically any number of times.<br/>
/// ○ Match a character in the set [1-9].<br/>
/// ○ Match a Unicode digit greedily any number of times.<br/>
- /// ○ Zero-width positive lookahead.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
CharLoopEnd2:
//}
- // Zero-width positive lookahead.
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
{
- slice = inputSpan.Slice(pos);
- int positivelookahead_starting_pos = pos;
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CharLoopBacktrack2;
- }
-
- pos = positivelookahead_starting_pos;
- slice = inputSpan.Slice(pos);
+ goto CharLoopBacktrack2;
}
// The input matched."(((?<!\\d+\\s*)-\\s*)|(?<=\\b))\\d+\\s*(K|k| ..." (215 uses)[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|(?<=\\b))\\d+\\s*(K|k|M|T|G)(?=\\b)", RegexOptions.Singleline)] /// ○ Match a whitespace character atomically any number of times.<br/>
/// ○ 3rd capture group.<br/>
/// ○ Match a character in the set [GKMTk].<br/>
- /// ○ Zero-width positive lookahead.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
base.Capture(3, capture_starting_pos2, pos);
}
- // Zero-width positive lookahead.
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
{
- slice = inputSpan.Slice(pos);
- int positivelookahead_starting_pos = pos;
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CaptureBacktrack;
- }
-
- pos = positivelookahead_starting_pos;
- slice = inputSpan.Slice(pos);
+ goto CaptureBacktrack;
}
// The input matched."(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+[\\., ..." (200 uses)[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+[\\.,])))(\\d+([\\.,]\\d+)?)\\^([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)] /// ○ Match a character in the set [+\-] atomically any number of times.<br/>
/// ○ Match a character in the set [1-9].<br/>
/// ○ Match a Unicode digit greedily any number of times.<br/>
- /// ○ Zero-width positive lookahead.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
CharLoopEnd2:
//}
- // Zero-width positive lookahead.
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
{
- slice = inputSpan.Slice(pos);
- int positivelookahead_starting_pos = pos;
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CharLoopBacktrack2;
- }
-
- pos = positivelookahead_starting_pos;
- slice = inputSpan.Slice(pos);
+ goto CharLoopBacktrack2;
}
// The input matched."(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+[\\., ..." (200 uses)[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+[\\.,])))(\\d+([\\.,]\\d+)?)e([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)] /// ○ Match a character in the set [+\-] atomically any number of times.<br/>
/// ○ Match a character in the set [1-9].<br/>
/// ○ Match a Unicode digit greedily any number of times.<br/>
- /// ○ Zero-width positive lookahead.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
CharLoopEnd2:
//}
- // Zero-width positive lookahead.
+ // Match if at a word boundary.
+ if (!Utilities.IsBoundary(inputSpan, pos))
{
- slice = inputSpan.Slice(pos);
- int positivelookahead_starting_pos = pos;
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos))
- {
- goto CharLoopBacktrack2;
- }
-
- pos = positivelookahead_starting_pos;
- slice = inputSpan.Slice(pos);
+ goto CharLoopBacktrack2;
}
// The input matched.For more diff examples, see https://gist.github.com/MihuBot/262e88321889757cdd1ba8ed32fc4fd4 JIT assembly changesFor a list of JIT diff regressions, see Regressions.md Sample source code for further analysisconst string JsonPath = "RegexResults-1301.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E2myvU9A");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
A positive lookahead effectively changes its contents to be zero-width. If the contents is already zero-width, the lookaround adds no value.