Skip to content

Utf16.IsValid and Utf8/16.IndexOfInvalidSubsequence#120326

Open
lilinus wants to merge 13 commits intodotnet:mainfrom
lilinus:utf8-utf16-validation
Open

Utf16.IsValid and Utf8/16.IndexOfInvalidSubsequence#120326
lilinus wants to merge 13 commits intodotnet:mainfrom
lilinus:utf8-utf16-validation

Conversation

@lilinus
Copy link
Contributor

@lilinus lilinus commented Oct 2, 2025

Fixes #118018
Fixes #118113

Copilot AI review requested due to automatic review settings October 2, 2025 14:24
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 2, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 2, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds new UTF-16 validation APIs to System.Text.Unicode, introducing Utf16.IsValid and Utf16.IndexOfInvalidSubsequence methods alongside similar functionality for UTF-8. It also refactors existing code to use these new standardized validation APIs instead of custom validation logic.

  • Adds new Utf16 class with validation methods IsValid and IndexOfInvalidSubsequence
  • Adds Utf8.IndexOfInvalidSubsequence method to complement existing Utf8.IsValid
  • Refactors multiple codebases to use the new standardized validation APIs

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
System.Private.CoreLib.Shared.projitems Adds the new Utf16.cs file to the build
Utf16.cs New class implementing UTF-16 validation methods
Utf8.cs Adds IndexOfInvalidSubsequence method and updates class documentation
System.Runtime.cs Adds public API surface for new UTF-16 and UTF-8 validation methods
HttpEncoder.cs Refactors surrogate validation to use new Utf16.IndexOfInvalidSubsequence
Normalization.Icu.cs Replaces custom validation logic with Utf16.IsValid
StringSearchValues.cs Simplifies surrogate validation using Utf16.IsValid
Utf8UtilityTests.ValidateBytes.cs Adds test coverage for new Utf8.IndexOfInvalidSubsequence method
Utf16UtilityTests.ValidateChars.cs Adds test coverage for new Utf16 validation methods

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

@jeffhandley jeffhandley removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 14, 2026
Copilot AI review requested due to automatic review settings February 1, 2026 18:58
@jeffhandley
Copy link
Member

@lilinus Thanks for your patience on this; we are planning to review it this month. I've updated the branch and CI will re-run, but there was an ApiCompat error in the build previously indicating Utf8.IndexOfInvalidSubsequence wasn't available in the net8.0 target and it wasn't obvious to me why that was the case.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Update XML doc comments and remove unneeded using statements.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 1, 2026 19:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

…/Text/Unicode/Utf8UtilityTests.ValidateBytes.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 4, 2026 11:26
Remove unnecessary blank line at the beginning of Utf16.cs
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

}
}

private static bool ContainsIncompleteSurrogatePairs(ReadOnlySpan<string> values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend renaming this private method to ContainsInvalidValues to since the implementation no longer directly checks the surrogate pairs.

Comment on lines +27 to +36
public static unsafe int IndexOfInvalidSubsequence(ReadOnlySpan<char> value)
{
fixed (char* pValue = &MemoryMarshal.GetReference(value))
{
char* pFirstInvalidChar = Utf16Utility.GetPointerToFirstInvalidChar(pValue, value.Length, out _, out _);
int index = (int)(pFirstInvalidChar - pValue);

return (index < value.Length) ? index : -1;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jozkee We're trying to minimize our unsafe code additions, but I suspect this one is appropriate. Please assess this yourself and consult with @tannergooding and @PranavSenthilnathan if you agree it seems to be an appropriate use case.

If we do conclude this is an appropriate unsafe addition, please add a comment or a remarks section justifying it.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 16, 2026 04:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment on lines 823 to 826
/// </summary>
/// <param name="value">The <see cref="ReadOnlySpan{T}"/> string.</param>
/// <returns><c>true</c> if value is well-formed UTF-8, <c>false</c> otherwise.</returns>
public static bool IsValid(ReadOnlySpan<byte> value) =>
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XML docs for Utf8.IsValid still describe the input as a "ReadOnlySpan string", but the parameter is a UTF-8 byte span. Please update the text to describe a UTF-8 byte sequence (similar wording to the new IndexOfInvalidSubsequence docs) so the public API docs are accurate.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Runtime community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[API Proposal]: Indexed version of Utf8.IsValid [API Proposal]: UTF-16 version of Utf8.IsValid

5 participants