Conversation
adamsitnik
left a comment
There was a problem hiding this comment.
This implementation is correct, but not optimal. Since the main goal of this chunker is best performance, please improve the implementation based on my feedback.
Thank you for your contribution @KrystofS !
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
6a72a5b to
2a49ac9
Compare
adamsitnik
left a comment
There was a problem hiding this comment.
Overall the code looks good, but we can still avoid some allocations. PTAL at my comments, thank you @KrystofS !
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
…mentTokenChunker.cs Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
…mentTokenChunker.cs Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
| ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory(); | ||
| while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk) | ||
| { | ||
| int index = _tokenizer.GetIndexByTokenCount( |
There was a problem hiding this comment.
This doesn't appear to be making any attempt to move the start/end of the chunk to a "good" location, e.g. this could be in the middle of a word?
There was a problem hiding this comment.
Correct. I don't think it is an issue because any RAG system should be resilient enough not to be affected by this assuming reasonable overlap size. Similarly this could take just a part of some table cell etc.
There was a problem hiding this comment.
Maybe. But I see other chunking systems going to great lengths to try to find good boundaries for the chunks.
There was a problem hiding this comment.
@stephentoub what other chunking systems are you referring to? Langchains TokenTextSplitter could split any word in similar fashion, so could TokenChunker from chonkie. I'd say it's true in general not for this type of token count based chunker.
There was a problem hiding this comment.
I suggest adding only a warning to documentation and keeping the current behavior.
There was a problem hiding this comment.
@stephentoub can we move on with this one? I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.
There was a problem hiding this comment.
Assuming that it's documented, works fine in some cases and our competitors provide similar feature, I am fine merging it.
I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.
Just out of curiosity, have you tried the HeaderChunker I've implemented?
adamsitnik
left a comment
There was a problem hiding this comment.
@KrystofS it's almost ready, PTAL at my last comment. Thanks!
src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj
Show resolved
Hide resolved
| unsafe | ||
| { | ||
| fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span)) | ||
| { | ||
| _ = stringBuilder.Append(ptr, index); | ||
| } | ||
| } |
There was a problem hiding this comment.
I suspect you are using unsafe to avoid string allocation of .NET Standard/Full Framework. I don't believe it's worth the struggle (we aim to not use unsafe at all when possible).
Please follow the pattern of passing span to builder on modern .NET and allocating otherwise:
There was a problem hiding this comment.
I recommended it. This could be a ton of string allocation, entirely unnecessarily. The unsafe use is very small and scoped and easily audited. I don't see a problem with it.
There was a problem hiding this comment.
I'm with @stephentoub on this one. I agree that unsafe use is contained to a very small portion of the code.
There was a problem hiding this comment.
We had a debate about the same thing in some other PR and I removed the unsafe.
We could at least move the unsafe to !NET:
#if NET
stringBuilder.Append(chars);
#else
unsafe goes here
#endif Or introduce an extension method that does take care of that of !NET
But I don't want to block @KrystofS, we can deal with it later.
cc @EgorBo Who is leading the effort of unsafe removal.
adamsitnik
left a comment
There was a problem hiding this comment.
LGTM, thank you for your contribution @KrystofS !
| unsafe | ||
| { | ||
| fixed (char* ptr = &MemoryMarshal.GetReference(contentToProcess.Span)) | ||
| { | ||
| _ = stringBuilder.Append(ptr, index); | ||
| } | ||
| } |
There was a problem hiding this comment.
We had a debate about the same thing in some other PR and I removed the unsafe.
We could at least move the unsafe to !NET:
#if NET
stringBuilder.Append(chars);
#else
unsafe goes here
#endif Or introduce an extension method that does take care of that of !NET
But I don't want to block @KrystofS, we can deal with it later.
cc @EgorBo Who is leading the effort of unsafe removal.
| ReadOnlyMemory<char> contentToProcess = elementContent.AsMemory(); | ||
| while (stringBuilderTokenCount + contentToProcessTokenCount >= _maxTokensPerChunk) | ||
| { | ||
| int index = _tokenizer.GetIndexByTokenCount( |
There was a problem hiding this comment.
Assuming that it's documented, works fine in some cases and our competitors provide similar feature, I am fine merging it.
I in my testing this method actually performed the best in RAG tasks with the default settings on my test dataset.
Just out of curiosity, have you tried the HeaderChunker I've implemented?
|
@adamsitnik I have not tried |
Microsoft Reviewers: Open in CodeFlow