Utf16.IsValid and Utf8/16.IndexOfInvalidSubsequence#120326
Utf16.IsValid and Utf8/16.IndexOfInvalidSubsequence#120326lilinus wants to merge 13 commits intodotnet:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds new UTF-16 validation APIs to System.Text.Unicode, introducing Utf16.IsValid and Utf16.IndexOfInvalidSubsequence methods alongside similar functionality for UTF-8. It also refactors existing code to use these new standardized validation APIs instead of custom validation logic.
- Adds new
Utf16class with validation methodsIsValidandIndexOfInvalidSubsequence - Adds
Utf8.IndexOfInvalidSubsequencemethod to complement existingUtf8.IsValid - Refactors multiple codebases to use the new standardized validation APIs
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| System.Private.CoreLib.Shared.projitems | Adds the new Utf16.cs file to the build |
| Utf16.cs | New class implementing UTF-16 validation methods |
| Utf8.cs | Adds IndexOfInvalidSubsequence method and updates class documentation |
| System.Runtime.cs | Adds public API surface for new UTF-16 and UTF-8 validation methods |
| HttpEncoder.cs | Refactors surrogate validation to use new Utf16.IndexOfInvalidSubsequence |
| Normalization.Icu.cs | Replaces custom validation logic with Utf16.IsValid |
| StringSearchValues.cs | Simplifies surrogate validation using Utf16.IsValid |
| Utf8UtilityTests.ValidateBytes.cs | Adds test coverage for new Utf8.IndexOfInvalidSubsequence method |
| Utf16UtilityTests.ValidateChars.cs | Adds test coverage for new Utf16 validation methods |
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16.cs
Outdated
Show resolved
Hide resolved
|
Tagging subscribers to this area: @dotnet/area-system-runtime |
|
@lilinus Thanks for your patience on this; we are planning to review it this month. I've updated the branch and CI will re-run, but there was an ApiCompat error in the build previously indicating |
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16.cs
Outdated
Show resolved
Hide resolved
Update XML doc comments and remove unneeded using statements. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
...tem.Runtime/tests/System.Runtime.Tests/System/Text/Unicode/Utf8UtilityTests.ValidateBytes.cs
Outdated
Show resolved
Hide resolved
…/Text/Unicode/Utf8UtilityTests.ValidateBytes.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Remove unnecessary blank line at the beginning of Utf16.cs
| } | ||
| } | ||
|
|
||
| private static bool ContainsIncompleteSurrogatePairs(ReadOnlySpan<string> values) |
There was a problem hiding this comment.
I recommend renaming this private method to ContainsInvalidValues to since the implementation no longer directly checks the surrogate pairs.
| public static unsafe int IndexOfInvalidSubsequence(ReadOnlySpan<char> value) | ||
| { | ||
| fixed (char* pValue = &MemoryMarshal.GetReference(value)) | ||
| { | ||
| char* pFirstInvalidChar = Utf16Utility.GetPointerToFirstInvalidChar(pValue, value.Length, out _, out _); | ||
| int index = (int)(pFirstInvalidChar - pValue); | ||
|
|
||
| return (index < value.Length) ? index : -1; | ||
| } | ||
| } |
There was a problem hiding this comment.
@jozkee We're trying to minimize our unsafe code additions, but I suspect this one is appropriate. Please assess this yourself and consult with @tannergooding and @PranavSenthilnathan if you agree it seems to be an appropriate use case.
If we do conclude this is an appropriate unsafe addition, please add a comment or a remarks section justifying it.
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| /// </summary> | ||
| /// <param name="value">The <see cref="ReadOnlySpan{T}"/> string.</param> | ||
| /// <returns><c>true</c> if value is well-formed UTF-8, <c>false</c> otherwise.</returns> | ||
| public static bool IsValid(ReadOnlySpan<byte> value) => |
There was a problem hiding this comment.
The XML docs for Utf8.IsValid still describe the input as a "ReadOnlySpan string", but the parameter is a UTF-8 byte span. Please update the text to describe a UTF-8 byte sequence (similar wording to the new IndexOfInvalidSubsequence docs) so the public API docs are accurate.
Fixes #118018
Fixes #118113