Skip to content

Conversation

@stephentoub
Copy link
Member

A positive lookahead effectively changes its contents to be zero-width. If the contents is already zero-width, the lookaround adds no value.

A positive lookahead effectively changes its contents to be zero-width. If the contents is already zero-width, the lookaround adds no value.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements an optimization to remove redundant positive lookarounds that wrap only zero-width assertions. When a positive lookahead contains only zero-width assertions (like anchors or word boundaries), the lookahead wrapper serves no purpose since the contained assertions are already zero-width and don't consume input.

Key changes:

  • Adds logic to detect and remove unnecessary positive lookaround wrappers around zero-width assertions
  • Updates test expectations to reflect the new optimization behavior
  • Refines anchor detection logic to properly handle boundary assertions within lookarounds

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
RegexNode.cs Implements the core optimization logic to detect zero-width assertions and remove redundant positive lookaround wrappers
RegexPrefixAnalyzer.cs Updates anchor detection to properly handle boundary assertions within lookarounds by moving boundary checks to the zero-width assertion skip logic
RegexReductionTests.cs Adds comprehensive test cases covering various scenarios of lookaround reduction with zero-width assertions
RegexFindOptimizationsTests.cs Updates test expectation to reflect improved anchor detection after lookaround removal

@stephentoub
Copy link
Member Author

@MihuBot regexdiff

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

@MihuBot
Copy link

MihuBot commented Jul 27, 2025

173 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs
"(((?<=\\W|^)-\\s*)|(?<=\\b))\\d+\\s*(B|b|m|t ..." (540 uses)
[GeneratedRegex("(((?<=\\W|^)-\\s*)|(?<=\\b))\\d+\\s*(B|b|m|t|g)(?=\\b)", RegexOptions.Singleline)]
  /// ○ Match a whitespace character atomically any number of times.<br/>
  /// ○ 3rd capture group.<br/>
  ///     ○ Match a character in the set [Bbgmt].<br/>
-   /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match if at a word boundary.<br/>
+   /// ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      base.Capture(3, capture_starting_pos2, pos);
                  }
                  
-                   // Zero-width positive lookahead.
+                   // Match if at a word boundary.
+                   if (!Utilities.IsBoundary(inputSpan, pos))
                  {
-                       slice = inputSpan.Slice(pos);
-                       int positivelookahead_starting_pos = pos;
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       // Match if at a word boundary.
-                       if (!Utilities.IsBoundary(inputSpan, pos))
-                       {
-                           goto CaptureBacktrack;
-                       }
-                       
-                       pos = positivelookahead_starting_pos;
-                       slice = inputSpan.Slice(pos);
+                       goto CaptureBacktrack;
                  }
                  
                  // The input matched.
"(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))( ..." (338 uses)
[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))(\\d+(,\\d+)?)\\^([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.IgnoreCase | RegexOptions.Singleline)]
  ///     ○ Match a character in the set [+\-] atomically any number of times.<br/>
  ///     ○ Match a character in the set [1-9].<br/>
  ///     ○ Match a Unicode digit greedily any number of times.<br/>
-   /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match if at a word boundary.<br/>
+   /// ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      CaptureSkipBacktrack2:;
                  //}
                  
-                   // Zero-width positive lookahead.
+                   // Match if at a word boundary.
+                   if (!Utilities.IsBoundary(inputSpan, pos))
                  {
-                       slice = inputSpan.Slice(pos);
-                       int positivelookahead_starting_pos = pos;
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       // Match if at a word boundary.
-                       if (!Utilities.IsBoundary(inputSpan, pos))
-                       {
-                           goto CaptureBacktrack2;
-                       }
-                       
-                       pos = positivelookahead_starting_pos;
-                       slice = inputSpan.Slice(pos);
+                       goto CaptureBacktrack2;
                  }
                  
                  // The input matched.
"(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))( ..." (338 uses)
[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))(\\d+(,\\d+)?)e([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.IgnoreCase | RegexOptions.Singleline)]
  ///     ○ Match a character in the set [+\-] atomically any number of times.<br/>
  ///     ○ Match a character in the set [1-9].<br/>
  ///     ○ Match a Unicode digit greedily any number of times.<br/>
-   /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match if at a word boundary.<br/>
+   /// ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      CaptureSkipBacktrack2:;
                  //}
                  
-                   // Zero-width positive lookahead.
+                   // Match if at a word boundary.
+                   if (!Utilities.IsBoundary(inputSpan, pos))
                  {
-                       slice = inputSpan.Slice(pos);
-                       int positivelookahead_starting_pos = pos;
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       // Match if at a word boundary.
-                       if (!Utilities.IsBoundary(inputSpan, pos))
-                       {
-                           goto CaptureBacktrack2;
-                       }
-                       
-                       pos = positivelookahead_starting_pos;
-                       slice = inputSpan.Slice(pos);
+                       goto CaptureBacktrack2;
                  }
                  
                  // The input matched.
"(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+\\,)) ..." (326 uses)
[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+\\,)))\\d+,\\d+\\s*(K|k|M|G|T)(?=\\b)", RegexOptions.Singleline)]
  /// ○ Match a whitespace character atomically any number of times.<br/>
  /// ○ 4th capture group.<br/>
  ///     ○ Match a character in the set [GKMTk].<br/>
-   /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match if at a word boundary.<br/>
+   /// ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      base.Capture(4, capture_starting_pos3, pos);
                  }
                  
-                   // Zero-width positive lookahead.
+                   // Match if at a word boundary.
+                   if (!Utilities.IsBoundary(inputSpan, pos))
                  {
-                       slice = inputSpan.Slice(pos);
-                       int positivelookahead_starting_pos = pos;
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       // Match if at a word boundary.
-                       if (!Utilities.IsBoundary(inputSpan, pos))
-                       {
-                           goto CaptureBacktrack;
-                       }
-                       
-                       pos = positivelookahead_starting_pos;
-                       slice = inputSpan.Slice(pos);
+                       goto CaptureBacktrack;
                  }
                  
                  // The input matched.
"(((?<=\\W|^)-\\s*)|(?<=\\b))\\d+\\s*(k|M|T|G ..." (325 uses)
[GeneratedRegex("(((?<=\\W|^)-\\s*)|(?<=\\b))\\d+\\s*(k|M|T|G)(?=\\b)", RegexOptions.Singleline)]
  /// ○ Match a whitespace character atomically any number of times.<br/>
  /// ○ 3rd capture group.<br/>
  ///     ○ Match a character in the set [GMTk].<br/>
-   /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match if at a word boundary.<br/>
+   /// ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      base.Capture(3, capture_starting_pos2, pos);
                  }
                  
-                   // Zero-width positive lookahead.
+                   // Match if at a word boundary.
+                   if (!Utilities.IsBoundary(inputSpan, pos))
                  {
-                       slice = inputSpan.Slice(pos);
-                       int positivelookahead_starting_pos = pos;
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       // Match if at a word boundary.
-                       if (!Utilities.IsBoundary(inputSpan, pos))
-                       {
-                           goto CaptureBacktrack;
-                       }
-                       
-                       pos = positivelookahead_starting_pos;
-                       slice = inputSpan.Slice(pos);
+                       goto CaptureBacktrack;
                  }
                  
                  // The input matched.
"(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))( ..." (252 uses)
[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))(\\d+(,\\d+)?)\\^([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)]
  /// ○ Match a character in the set [+\-] atomically any number of times.<br/>
  /// ○ Match a character in the set [1-9].<br/>
  /// ○ Match a Unicode digit greedily any number of times.<br/>
-   /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match if at a word boundary.<br/>
+   /// ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      CharLoopEnd2:
                  //}
                  
-                   // Zero-width positive lookahead.
+                   // Match if at a word boundary.
+                   if (!Utilities.IsBoundary(inputSpan, pos))
                  {
-                       slice = inputSpan.Slice(pos);
-                       int positivelookahead_starting_pos = pos;
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       // Match if at a word boundary.
-                       if (!Utilities.IsBoundary(inputSpan, pos))
-                       {
-                           goto CharLoopBacktrack2;
-                       }
-                       
-                       pos = positivelookahead_starting_pos;
-                       slice = inputSpan.Slice(pos);
+                       goto CharLoopBacktrack2;
                  }
                  
                  // The input matched.
"(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))( ..." (252 uses)
[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+,)))(\\d+(,\\d+)?)e([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)]
  /// ○ Match a character in the set [+\-] atomically any number of times.<br/>
  /// ○ Match a character in the set [1-9].<br/>
  /// ○ Match a Unicode digit greedily any number of times.<br/>
-   /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match if at a word boundary.<br/>
+   /// ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      CharLoopEnd2:
                  //}
                  
-                   // Zero-width positive lookahead.
+                   // Match if at a word boundary.
+                   if (!Utilities.IsBoundary(inputSpan, pos))
                  {
-                       slice = inputSpan.Slice(pos);
-                       int positivelookahead_starting_pos = pos;
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       // Match if at a word boundary.
-                       if (!Utilities.IsBoundary(inputSpan, pos))
-                       {
-                           goto CharLoopBacktrack2;
-                       }
-                       
-                       pos = positivelookahead_starting_pos;
-                       slice = inputSpan.Slice(pos);
+                       goto CharLoopBacktrack2;
                  }
                  
                  // The input matched.
"(((?<!\\d+\\s*)-\\s*)|(?<=\\b))\\d+\\s*(K|k| ..." (215 uses)
[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|(?<=\\b))\\d+\\s*(K|k|M|T|G)(?=\\b)", RegexOptions.Singleline)]
  /// ○ Match a whitespace character atomically any number of times.<br/>
  /// ○ 3rd capture group.<br/>
  ///     ○ Match a character in the set [GKMTk].<br/>
-   /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match if at a word boundary.<br/>
+   /// ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      base.Capture(3, capture_starting_pos2, pos);
                  }
                  
-                   // Zero-width positive lookahead.
+                   // Match if at a word boundary.
+                   if (!Utilities.IsBoundary(inputSpan, pos))
                  {
-                       slice = inputSpan.Slice(pos);
-                       int positivelookahead_starting_pos = pos;
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       // Match if at a word boundary.
-                       if (!Utilities.IsBoundary(inputSpan, pos))
-                       {
-                           goto CaptureBacktrack;
-                       }
-                       
-                       pos = positivelookahead_starting_pos;
-                       slice = inputSpan.Slice(pos);
+                       goto CaptureBacktrack;
                  }
                  
                  // The input matched.
"(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+[\\., ..." (200 uses)
[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+[\\.,])))(\\d+([\\.,]\\d+)?)\\^([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)]
  /// ○ Match a character in the set [+\-] atomically any number of times.<br/>
  /// ○ Match a character in the set [1-9].<br/>
  /// ○ Match a Unicode digit greedily any number of times.<br/>
-   /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match if at a word boundary.<br/>
+   /// ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      CharLoopEnd2:
                  //}
                  
-                   // Zero-width positive lookahead.
+                   // Match if at a word boundary.
+                   if (!Utilities.IsBoundary(inputSpan, pos))
                  {
-                       slice = inputSpan.Slice(pos);
-                       int positivelookahead_starting_pos = pos;
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       // Match if at a word boundary.
-                       if (!Utilities.IsBoundary(inputSpan, pos))
-                       {
-                           goto CharLoopBacktrack2;
-                       }
-                       
-                       pos = positivelookahead_starting_pos;
-                       slice = inputSpan.Slice(pos);
+                       goto CharLoopBacktrack2;
                  }
                  
                  // The input matched.
"(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+[\\., ..." (200 uses)
[GeneratedRegex("(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!\\d+[\\.,])))(\\d+([\\.,]\\d+)?)e([+-]*[1-9]\\d*)(?=\\b)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)]
  /// ○ Match a character in the set [+\-] atomically any number of times.<br/>
  /// ○ Match a character in the set [1-9].<br/>
  /// ○ Match a Unicode digit greedily any number of times.<br/>
-   /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match if at a word boundary.<br/>
+   /// ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      CharLoopEnd2:
                  //}
                  
-                   // Zero-width positive lookahead.
+                   // Match if at a word boundary.
+                   if (!Utilities.IsBoundary(inputSpan, pos))
                  {
-                       slice = inputSpan.Slice(pos);
-                       int positivelookahead_starting_pos = pos;
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       // Match if at a word boundary.
-                       if (!Utilities.IsBoundary(inputSpan, pos))
-                       {
-                           goto CharLoopBacktrack2;
-                       }
-                       
-                       pos = positivelookahead_starting_pos;
-                       slice = inputSpan.Slice(pos);
+                       goto CharLoopBacktrack2;
                  }
                  
                  // The input matched.

For more diff examples, see https://gist.github.com/MihuBot/262e88321889757cdd1ba8ed32fc4fd4

JIT assembly changes
Total bytes of base: 54186495
Total bytes of diff: 54183740
Total bytes of delta: -2755 (-0.01 % of base)
Total relative delta: -0.28
    diff is an improvement.
    relative diff is an improvement.

For a list of JIT diff regressions, see Regressions.md
For a list of JIT diff improvements, see Improvements.md

Sample source code for further analysis
const string JsonPath = "RegexResults-1301.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E2myvU9A");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

@stephentoub stephentoub merged commit 7a68903 into dotnet:main Jul 28, 2025
85 of 87 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 27, 2025
@stephentoub stephentoub deleted the removelookaround branch December 12, 2025 23:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants