Introducing SentencePiece Unigram Tokenizer Model#7390
Merged
tarekgh merged 9 commits intodotnet:mainfrom Feb 25, 2025
Merged
Conversation
Member
Author
Contributor
There was a problem hiding this comment.
Copilot reviewed 11 out of 14 changed files in this pull request and generated 1 comment.
Files not reviewed (3)
- THIRD-PARTY-NOTICES.TXT: Language not supported
- eng/Versions.props: Language not supported
- src/Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs: Evaluated as low risk
ericstj
reviewed
Feb 18, 2025
Member
ericstj
left a comment
There was a problem hiding this comment.
Didn't dig too deep just a few small observations and suggestions.
Contributor
There was a problem hiding this comment.
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs:392
- In this loop, calling sp.Slice(1) does not update the variable 'sp', which may result in an infinite loop if the condition remains true. Consider reassigning the sliced span to 'sp' (e.g., sp = sp.Slice(1)).
while (isPrevSpace && sp.Length > 0 && sp[0] == (byte)' ') { sp.Slice(1); }
stephentoub
reviewed
Feb 19, 2025
stephentoub
reviewed
Feb 19, 2025
stephentoub
reviewed
Feb 19, 2025
stephentoub
reviewed
Feb 19, 2025
stephentoub
reviewed
Feb 19, 2025
stephentoub
reviewed
Feb 19, 2025
stephentoub
reviewed
Feb 19, 2025
Contributor
luisquintanilla
left a comment
There was a problem hiding this comment.
Typo - SentencePieceUingramModel.cs should be SentencePieceUnigramModel.cs
ericstj
reviewed
Feb 20, 2025
ericstj
reviewed
Feb 20, 2025
ericstj
reviewed
Feb 20, 2025
ericstj
reviewed
Feb 20, 2025
ericstj
reviewed
Feb 20, 2025
ericstj
approved these changes
Feb 21, 2025
Member
ericstj
left a comment
There was a problem hiding this comment.
Thank you for addressing feedback. This looks good to me. Please check if @michaelgsharp has feedback too.
michaelgsharp
approved these changes
Feb 24, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #7186
We have been supporting SentencePiece Bpe model for a while, the change here is introducing the support to the SentencePiece Unigram tokenizer model.
Users can create a tokenizer instance by using a code like the following:
Notes around the change
SentencePieceToknizer.csfile into a newly introduced fileSentencePieceBaseModel.csandSentencePieceBpeModel.cs. Most of the code in the newly introduced files are mostly not changed.SentencePieceUnigramModel.csfile.SentencePieceTokenizer.csto work with the model abstraction and automatically handle both models Bpe and Unigram.