Add DocumentTokenChunker by KrystofS · Pull Request #7093 · dotnet/extensions

KrystofS · 2025-11-30T18:17:28Z

Microsoft Reviewers: Open in CodeFlow

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

adamsitnik

This implementation is correct, but not optimal. Since the main goal of this chunker is best performance, please improve the implementation based on my feedback.

Thank you for your contribution @KrystofS !

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs

adamsitnik

Overall the code looks good, but we can still avoid some allocations. PTAL at my comments, thank you @KrystofS !

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs

…mentTokenChunker.cs Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs

stephentoub · 2025-12-10T04:04:25Z

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs

+                ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory();
+                while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk)
+                {
+                    int index = _tokenizer.GetIndexByTokenCount(


This doesn't appear to be making any attempt to move the start/end of the chunk to a "good" location, e.g. this could be in the middle of a word?

Correct. I don't think it is an issue because any RAG system should be resilient enough not to be affected by this assuming reasonable overlap size. Similarly this could take just a part of some table cell etc.

Maybe. But I see other chunking systems going to great lengths to try to find good boundaries for the chunks.

@stephentoub what other chunking systems are you referring to? Langchains TokenTextSplitter could split any word in similar fashion, so could TokenChunker from chonkie. I'd say it's true in general not for this type of token count based chunker.

I suggest adding only a warning to documentation and keeping the current behavior.

@stephentoub can we move on with this one? I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.

I'll leave it up to @adamsitnik.

Assuming that it's documented, works fine in some cases and our competitors provide similar feature, I am fine merging it.

I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.

Just out of curiosity, have you tried the HeaderChunker I've implemented?

adamsitnik

@KrystofS it's almost ready, PTAL at my last comment. Thanks!

src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj

adamsitnik · 2025-12-11T17:32:03Z

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs

+                    unsafe
+                    {
+                        fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span))
+                        {
+                            _ = stringBuilder.Append(ptr, index);
+                        }
+                    }


I suspect you are using unsafe to avoid string allocation of .NET Standard/Full Framework. I don't believe it's worth the struggle (we aim to not use unsafe at all when possible).

Please follow the pattern of passing span to builder on modern .NET and allocating otherwise:

extensions/src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs

Lines 229 to 233 in e801349

#if NET

stringBuilder.Append(chars);

#else

stringBuilder.Append(chars.ToString());

#endif

I recommended it. This could be a ton of string allocation, entirely unnecessarily. The unsafe use is very small and scoped and easily audited. I don't see a problem with it.

I'm with @stephentoub on this one. I agree that unsafe use is contained to a very small portion of the code.

We had a debate about the same thing in some other PR and I removed the unsafe.

We could at least move the unsafe to !NET:

#if NET stringBuilder.Append(chars); #else unsafe goes here #endif

Or introduce an extension method that does take care of that of !NET

But I don't want to block @KrystofS, we can deal with it later.

cc @EgorBo Who is leading the effort of unsafe removal.

adamsitnik

LGTM, thank you for your contribution @KrystofS !

adamsitnik · 2025-12-12T19:34:03Z

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs

+                    unsafe
+                    {
+                        fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span))
+                        {
+                            _ = stringBuilder.Append(ptr, index);
+                        }
+                    }


We had a debate about the same thing in some other PR and I removed the unsafe.

We could at least move the unsafe to !NET:

#if NET stringBuilder.Append(chars); #else unsafe goes here #endif

Or introduce an extension method that does take care of that of !NET

But I don't want to block @KrystofS, we can deal with it later.

cc @EgorBo Who is leading the effort of unsafe removal.

adamsitnik · 2025-12-12T19:37:15Z

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs

+                ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory();
+                while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk)
+                {
+                    int index = _tokenizer.GetIndexByTokenCount(


Assuming that it's documented, works fine in some cases and our competitors provide similar feature, I am fine merging it.

I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.

Just out of curiosity, have you tried the HeaderChunker I've implemented?

KrystofS · 2025-12-12T19:48:09Z

@adamsitnik I have not tried HeaderChunker, I can extend my testing and let you know about the results.

Copilot AI review requested due to automatic review settings November 30, 2025 18:17

github-actions bot added the area-telemetry label Nov 30, 2025

Copilot started reviewing on behalf of KrystofS November 30, 2025 18:17 View session

dotnet-policy-service bot assigned KrystofS Nov 30, 2025

Copilot AI reviewed Nov 30, 2025

View reviewed changes

Copilot finished reviewing on behalf of KrystofS November 30, 2025 18:42

adamsitnik reviewed Dec 5, 2025

View reviewed changes

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Outdated Show resolved Hide resolved

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Outdated Show resolved Hide resolved

KrystofS requested a review from adamsitnik December 6, 2025 12:00

KrystofS added 7 commits December 8, 2025 17:46

Create DocumentTokenChunker.cs

4b06bfb

add DocumentTokenChunker tests

6f24e77

refactor DocumentTokenChunker

7aecf83

extend test coverage

e05b4e7

update docs

a719837

fix sonar issue

bb8f975

remove extra blank line

2a49ac9

KrystofS force-pushed the feature/DocumentTokenChunker branch from 6a72a5b to 2a49ac9 Compare December 8, 2025 16:46

adamsitnik reviewed Dec 8, 2025

View reviewed changes

KrystofS and others added 2 commits December 8, 2025 18:10

Update src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/Docu…

68d6827

…mentTokenChunker.cs Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>

Update src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/Docu…

408db33

…mentTokenChunker.cs Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>

adamsitnik requested a review from Copilot December 8, 2025 17:44

Copilot started reviewing on behalf of adamsitnik December 8, 2025 17:45 View session

Copilot AI reviewed Dec 8, 2025

View reviewed changes

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Outdated Show resolved Hide resolved

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Outdated Show resolved Hide resolved

adamsitnik requested a review from stephentoub December 8, 2025 18:59

stephentoub reviewed Dec 9, 2025

View reviewed changes

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Outdated Show resolved Hide resolved

stephentoub reviewed Dec 9, 2025

View reviewed changes

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Outdated Show resolved Hide resolved

stephentoub reviewed Dec 9, 2025

View reviewed changes

src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Outdated Show resolved Hide resolved

implement review suggestions

68b6ed6

KrystofS requested a review from stephentoub December 10, 2025 00:15

stephentoub reviewed Dec 10, 2025

View reviewed changes

KrystofS requested a review from stephentoub December 10, 2025 17:33

adamsitnik reviewed Dec 11, 2025

View reviewed changes

adamsitnik approved these changes Dec 12, 2025

View reviewed changes

adamsitnik merged commit 20db541 into dotnet:main Dec 12, 2025
6 checks passed

github-actions bot locked and limited conversation to collaborators Jan 12, 2026

	#if NET
	stringBuilder.Append(chars);
	#else
	stringBuilder.Append(chars.ToString());
	#endif

Conversation

KrystofS commented Nov 30, 2025 • edited by dotnet-policy-service bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microsoft Reviewers: Open in CodeFlow

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KrystofS Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KrystofS commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KrystofS commented Nov 30, 2025 •

edited by dotnet-policy-service bot

Loading

KrystofS Dec 10, 2025 •

edited

Loading