Skip to content

Split long introductory sentences#951

Open
benjaminking wants to merge 10 commits intomasterfrom
split_introductory_sentences
Open

Split long introductory sentences#951
benjaminking wants to merge 10 commits intomasterfrom
split_introductory_sentences

Conversation

@benjaminking
Copy link
Collaborator

@benjaminking benjaminking commented Feb 26, 2026

This PR fixes #626 by splitting long non-Scripture segments in USFM documents using the NLTK sentence tokenizer and re-combining the segments after translation. This also required some refactorizing, which is what many of the changes are.


This change is Reviewable

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #626 by implementing sentence-level splitting for long non-Scripture USFM segments (>200 characters) to improve translation quality. The solution splits long introductory paragraphs and other non-verse content into individual sentences before translation, then recombines them afterwards while preserving structure.

Changes:

  • Refactored translation data structures (SentenceTranslation, SentenceTranslationGroup, TranslatedDraft, DraftGroup) into a new module translation_data_structures.py to support new combining operations
  • Introduced UsfmTextRowCollection and TranslatedTextRowCollection classes to handle splitting and recombining of USFM text rows
  • Created NLTKSentenceTokenizer wrapper with caching to standardize sentence tokenization across the codebase

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
silnlp/common/translation_data_structures.py New file containing refactored translation data structures with combine() methods to merge split sentence translations
silnlp/common/usfm_utils.py Added UsfmTextRowCollection to split long non-verse rows and TranslatedTextRowCollection to recombine translations
silnlp/common/utils.py Added NLTKSentenceTokenizer class with language-aware sentence splitting and instance caching
silnlp/common/translator.py Refactored to use new UsfmTextRowCollection; removed old data structure classes; cleaned up USFM processing logic
silnlp/nmt/translate.py Updated imports to reference new translation_data_structures module
silnlp/nmt/hugging_face_config.py Updated imports and changed return type from list to SentenceTranslationGroup
silnlp/nmt/config.py Updated imports for SentenceTranslationGroup
silnlp/common/translate_google.py Updated to use new SentenceTranslationGroup class and updated imports

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjaminking made 9 comments and resolved 9 discussions.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjaminking made 5 comments and resolved 5 discussions.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.

Copy link
Collaborator

@TaperChipmunk32 TaperChipmunk32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@TaperChipmunk32 made 1 comment.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Drafts for introductory paragraphs are often less accurate due to their extended length (> 200 tokens)

3 participants