Skip to content

[API Proposal]: Indexed version of Utf8.IsValid #118113

@hamarb123

Description

@hamarb123

Background and motivation

The API I want is basically GetIndexOfFirstInvalidUtf8Sequence, which IsValid is based on:

public static bool IsValid(ReadOnlySpan<byte> value) =>
Utf8Utility.GetIndexOfFirstInvalidUtf8Sequence(value, out _) < 0;

I have a seperate (but related) use case for this as I do to #118018, and would ideally want both UTF-8 & UTF-16 versions of this API.

It would return the index of where the first invalid byte/char is, or -1 if all valid.

My main use case would be transcoding to/from a format which can store all char values in UTF-8 compatible format (I've been informed by @GrabYourPitchforks that the format I came up with is equivalent to https://simonsapin.github.io/wtf-8/ lol), & would use these APIs to achieve that - but I also want to be able to perform the following tasks:

  • Case-insensitive comparison of my UTF-8-like sequence
  • UTF-normalisation insensitive comparison of my UTF-8-like sequence
  • Combo of the above

Both of these need to be transcoded to UTF-16 to be done (outside of the ascii case), and would ideally be done in chunks, but still while having validation to ensure I have a valid string, and with special handling for U+FFFD (since normalisation complains about that one), so I really would benefit quite greatly from an index-of API to achieve this, rather than e.g., just implementing something like Encoding.Wtf8 & friends natively.

I'm going to list both Utf8 & Utf16 variants of this API based on the assumption that #118018 gets approved - so the Utf16 version should be based on whatever shape it ends up with presumably.

/cc @GrabYourPitchforks

API Proposal

namespace System.Text.Unicode;

public static class Utf8
{
    public static int IndexOfInvalidByte(ReadOnlySpan<byte> value) => ...;
}

public static class Utf16
{
    public static int IndexOfInvalidChar(ReadOnlySpan<char> value) => ...;
}

We might want to also include for the Ascii class:

namespace System.Text;

public static class Ascii
{
    public static int IndexOfInvalidByte(ReadOnlySpan<byte> value) => ...;
    public static int IndexOfInvalidChar(ReadOnlySpan<char> value) => ...;
}

API Usage

// Encoding UTF-16 to modified UTF-8
string s = ...;
var sp = s.AsSpan();
byte[] buffer = ArrayPool<byte>.Shared.Rent(sp.Length);
var len = 0;
while (true)
{
    var idx = Utf16.IndexOfInvalidChar(sp);
    if (idx < 0) idx = sp.Length;
    // ... ensure buffer is big enough based on actual chars we're about to add
    Encoding.UTF8.GetBytes(sp[..idx], buffer.AsSpan(len));
    sp = sp[len..];
    if (sp.Length > 0)
    {
        // ... encode 1 UTF-16 character into our buffer in our modified UTF-8 format
        sp = sp[1..];
    }
}
var result = buffer.AsSpan(0, len);

Alternative Designs

We may or may not want to also add to Ascii also; it will probably end up being useful in my optimised versions of some algorithms I suspect, but I'm not 100% certain yet.

Another option could be to add it as an overload, rather than a seperately named method, like so: public static bool IsValid(ReadOnlySpan<...> value, out int validLength) (or maybe indexOfFirstInvalidSequence rather than validLength); but this would then have the same shape as Base64.IsValid(value, out var length), which has a different meaning.

Risks

Confusing name.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions