-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Background and motivation
The API I want is basically GetIndexOfFirstInvalidUtf8Sequence, which IsValid is based on:
runtime/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8.cs
Lines 826 to 827 in dc6ac3a
| public static bool IsValid(ReadOnlySpan<byte> value) => | |
| Utf8Utility.GetIndexOfFirstInvalidUtf8Sequence(value, out _) < 0; |
I have a seperate (but related) use case for this as I do to #118018, and would ideally want both UTF-8 & UTF-16 versions of this API.
It would return the index of where the first invalid byte/char is, or -1 if all valid.
My main use case would be transcoding to/from a format which can store all char values in UTF-8 compatible format (I've been informed by @GrabYourPitchforks that the format I came up with is equivalent to https://simonsapin.github.io/wtf-8/ lol), & would use these APIs to achieve that - but I also want to be able to perform the following tasks:
- Case-insensitive comparison of my UTF-8-like sequence
- UTF-normalisation insensitive comparison of my UTF-8-like sequence
- Combo of the above
Both of these need to be transcoded to UTF-16 to be done (outside of the ascii case), and would ideally be done in chunks, but still while having validation to ensure I have a valid string, and with special handling for U+FFFD (since normalisation complains about that one), so I really would benefit quite greatly from an index-of API to achieve this, rather than e.g., just implementing something like Encoding.Wtf8 & friends natively.
I'm going to list both Utf8 & Utf16 variants of this API based on the assumption that #118018 gets approved - so the Utf16 version should be based on whatever shape it ends up with presumably.
API Proposal
namespace System.Text.Unicode;
public static class Utf8
{
public static int IndexOfInvalidByte(ReadOnlySpan<byte> value) => ...;
}
public static class Utf16
{
public static int IndexOfInvalidChar(ReadOnlySpan<char> value) => ...;
}We might want to also include for the Ascii class:
namespace System.Text;
public static class Ascii
{
public static int IndexOfInvalidByte(ReadOnlySpan<byte> value) => ...;
public static int IndexOfInvalidChar(ReadOnlySpan<char> value) => ...;
}API Usage
// Encoding UTF-16 to modified UTF-8
string s = ...;
var sp = s.AsSpan();
byte[] buffer = ArrayPool<byte>.Shared.Rent(sp.Length);
var len = 0;
while (true)
{
var idx = Utf16.IndexOfInvalidChar(sp);
if (idx < 0) idx = sp.Length;
// ... ensure buffer is big enough based on actual chars we're about to add
Encoding.UTF8.GetBytes(sp[..idx], buffer.AsSpan(len));
sp = sp[len..];
if (sp.Length > 0)
{
// ... encode 1 UTF-16 character into our buffer in our modified UTF-8 format
sp = sp[1..];
}
}
var result = buffer.AsSpan(0, len);Alternative Designs
We may or may not want to also add to Ascii also; it will probably end up being useful in my optimised versions of some algorithms I suspect, but I'm not 100% certain yet.
Another option could be to add it as an overload, rather than a seperately named method, like so: public static bool IsValid(ReadOnlySpan<...> value, out int validLength) (or maybe indexOfFirstInvalidSequence rather than validLength); but this would then have the same shape as Base64.IsValid(value, out var length), which has a different meaning.
Risks
Confusing name.