Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #626 by implementing sentence-level splitting for long non-Scripture USFM segments (>200 characters) to improve translation quality. The solution splits long introductory paragraphs and other non-verse content into individual sentences before translation, then recombines them afterwards while preserving structure.
Changes:
- Refactored translation data structures (SentenceTranslation, SentenceTranslationGroup, TranslatedDraft, DraftGroup) into a new module
translation_data_structures.pyto support new combining operations - Introduced
UsfmTextRowCollectionandTranslatedTextRowCollectionclasses to handle splitting and recombining of USFM text rows - Created
NLTKSentenceTokenizerwrapper with caching to standardize sentence tokenization across the codebase
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| silnlp/common/translation_data_structures.py | New file containing refactored translation data structures with combine() methods to merge split sentence translations |
| silnlp/common/usfm_utils.py | Added UsfmTextRowCollection to split long non-verse rows and TranslatedTextRowCollection to recombine translations |
| silnlp/common/utils.py | Added NLTKSentenceTokenizer class with language-aware sentence splitting and instance caching |
| silnlp/common/translator.py | Refactored to use new UsfmTextRowCollection; removed old data structure classes; cleaned up USFM processing logic |
| silnlp/nmt/translate.py | Updated imports to reference new translation_data_structures module |
| silnlp/nmt/hugging_face_config.py | Updated imports and changed return type from list to SentenceTranslationGroup |
| silnlp/nmt/config.py | Updated imports for SentenceTranslationGroup |
| silnlp/common/translate_google.py | Updated to use new SentenceTranslationGroup class and updated imports |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
benjaminking
left a comment
There was a problem hiding this comment.
@benjaminking made 9 comments and resolved 9 discussions.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
benjaminking
left a comment
There was a problem hiding this comment.
@benjaminking made 5 comments and resolved 5 discussions.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.
TaperChipmunk32
left a comment
There was a problem hiding this comment.
@TaperChipmunk32 made 1 comment.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.
This PR fixes #626 by splitting long non-Scripture segments in USFM documents using the NLTK sentence tokenizer and re-combining the segments after translation. This also required some refactorizing, which is what many of the changes are.
This change is