Skip to content

Add migration CLI and document process#2

Open
JeffCarpenter wants to merge 3 commits intomainfrom
p26w8z-codex/plan-feature-to-infer-embeddings-for-starred-repos
Open

Add migration CLI and document process#2
JeffCarpenter wants to merge 3 commits intomainfrom
p26w8z-codex/plan-feature-to-infer-embeddings-for-starred-repos

Conversation

@JeffCarpenter
Copy link
Owner

Summary

  • implement migrate command to create embedding tables
  • document migration workflow in README
  • mark migration items complete in PLAN
  • test the new command

Testing

  • pytest -q
  • pytest --cov=github_to_sqlite -q

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Copilot AI review requested due to automatic review settings June 11, 2025 17:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new migrate CLI command to set up embedding-related tables, updates dependencies, and documents the migration workflow.

  • Introduce migrate command to create embedding tables (repo_embeddings, readme_chunk_embeddings, repo_build_files, repo_metadata)
  • Update utils.ensure_embedding_tables and wire it into ensure_db_shape
  • Add tests for migration, embedding tables, and chunking utilities; update README and PLAN to document the new workflow

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/test_starred.py Expect new embedding-related tables in schema
tests/test_simple_chunker.py Add unit tests for SimpleChunker
tests/test_sentencizer_chunker.py Add tests for BasicSentencizerChunker
tests/test_repos.py Expect new embedding tables in repo tests
tests/test_migrate_command.py Test migrate CLI creates tables
tests/test_embedding_tables.py Test ensure_db_shape creates embedding tables
tests/test_config.py Test tokenizer config loading
tests/test_chunk_readme.py Add tests for chunk_readme fallback and plugin
setup.py Expand install_requires and extras_require
github_to_sqlite/utils.py Implement ensure_embedding_tables and chunk_readme
github_to_sqlite/tokenization.py Define load_tokenizer via functools.partial
github_to_sqlite/simple_chunker.py Add SimpleChunker with sentence splitting
github_to_sqlite/sentencizer_chunker.py Add BasicSentencizerChunker
github_to_sqlite/config.py Introduce Config model for defaults
github_to_sqlite/cli.py Add migrate command
README.md Document running migrations
PLAN.md Mark embedding-plan items as complete
AGENTS.md Link to PLAN.md
Comments suppressed due to low confidence (1)

github_to_sqlite/utils.py:940

  • The function uses re.split but re is not imported; add import re at the top of the file to avoid a NameError.
return [p.strip() for p in re.split(r"\n{2,}", text) if p.strip()]

extras_require={
"test": ["pytest", "pytest-cov", "requests-mock", "bs4"],
"semantic_chunkers": [
"semantic-chunkers @ https://github.com/aurelio-labs/semantic-chunkers/archive/refs/tags/v0.1.1.tar.gz"
Copy link

Copilot AI Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The same GitHub URL appears under both semantic_chunkers and semantic-transformers extras; consider consolidating or renaming one key to avoid duplication.

Copilot uses AI. Check for mistakes.
@sourcery-ai
Copy link

sourcery-ai bot commented Jun 11, 2025

Reviewer's Guide

Introduces a migration CLI to initialize embedding storage tables via a new utility that conditionally loads the sqlite-vec extension and creates necessary tables, adds functions to chunk README content, extends dependencies for embedding and chunking, documents the migration workflow, and supplies tests for all added features.

Sequence Diagram for migrate Command Execution

sequenceDiagram
    actor User
    participant CLI as "CLI (migrate cmd)"
    participant DB as "sqlite_utils.Database"
    participant Utils as "github_to_sqlite.utils"

    User->>CLI: Executes `github-to-sqlite migrate github.db`
    CLI->>DB: new Database(db_path)
    CLI->>Utils: ensure_embedding_tables(db)
    activate Utils
    Utils->>Utils: _maybe_load_sqlite_vec(db)
    alt sqlite-vec available
        Utils->>DB: execute("CREATE VIRTUAL TABLE repo_embeddings ...")
        Utils->>DB: execute("CREATE VIRTUAL TABLE readme_chunk_embeddings ...")
    else sqlite-vec not available
        Utils->>DB: db["repo_embeddings"].create(...)
        Utils->>DB: db["readme_chunk_embeddings"].create(...)
    end
    Utils->>DB: db["repo_build_files"].create(...)
    Utils->>DB: db["repo_metadata"].create(...)
    deactivate Utils
    CLI->>User: Prints "Database migrated"
Loading

Entity Relationship Diagram for New Embedding Tables

erDiagram
    repos {
        int id PK
        string name
    }
    repo_embeddings {
        int repo_id PK
        string title_embedding "BLOB or VECTOR (float[768])"
        string description_embedding "BLOB or VECTOR (float[768])"
        string readme_embedding "BLOB or VECTOR (float[768])"
    }
    readme_chunk_embeddings {
        int repo_id PK
        int chunk_index PK
        string chunk_text
        string embedding "BLOB or VECTOR (float[768])"
    }
    repo_build_files {
        int repo_id PK
        string file_path PK
        string metadata "JSON"
    }
    repo_metadata {
        int repo_id PK
        string language
        string directory_tree
    }

    repos ||--o{ repo_embeddings : "FK: repo_id"
    repos ||--o{ readme_chunk_embeddings : "FK: repo_id"
    repos ||--o{ repo_build_files : "FK: repo_id"
    repos ||--o{ repo_metadata : "FK: repo_id"
Loading

Class Diagram for Config in config.py

classDiagram
      class Config {
        +str default_model
        +str onnx_provider
        +int max_length
      }
Loading

Class Diagram for Token Dataclass in tokenization.py

classDiagram
      class Token {
        <<Dataclass>>
        +int id
        +str value
        +tuple offsets
      }
Loading

Class Diagram for BasicSentencizerChunker in sentencizer_chunker.py

classDiagram
      class BasicSentencizerChunker {
        +str period_token
        +__init__(period_token: str)
        +chunk(tokens: Sequence~str~, vectors: Sequence~TensorType~) List~List~TensorType~~
        +__call__(tokens: Sequence~str~, vectors: Sequence~TensorType~) List~List~TensorType~~
      }
Loading

Class Diagram for Chunker Classes in simple_chunker.py

classDiagram
    direction LR
    class BaseModel {
      <<Pydantic / Fallback>>
      note "Base for Pydantic models or fallback"
    }
    class BaseSplitter {
      <<Abstract>>
      +__call__(doc: str) List~str~
    }
    BaseSplitter --|> BaseModel

    class Chunk {
      <<Data / Fallback>>
      +str content
      +int token_count
      +bool is_triggered
      +float triggered_score
      note "Data structure for a text chunk"
    }

    class DenseEncoder {
      <<External>>
      note "From semantic_router.encoders.base"
    }

    class BaseChunker {
      +str name
      +DenseEncoder encoder
      +BaseSplitter splitter
      +__call__(docs: List~str~) List~List~Chunk~~
      #_split(doc: str) List~str~
      #_chunk(splits: List~Any~) List~Chunk~~
      +print(document_splits: List~Chunk~) void
    }
    BaseChunker --|> BaseModel
    BaseChunker ..> DenseEncoder : uses (optional)
    BaseChunker o-- BaseSplitter : uses

    class SimpleChunker {
      +int target_length
      +__init__(name: str, splitter: BaseSplitter, encoder: DenseEncoder, target_length: int)
      +__call__(docs: List~str~) List~List~Chunk~~
      #_split(doc: str) List~str~
      #_chunk(sentences: List~str~) List~Chunk~~
    }
    SimpleChunker --|> BaseChunker
Loading

Flow Diagram for ensure_embedding_tables Table Creation Logic

flowchart TD
    subgraph "ensure_embedding_tables(db) Logic"
        direction LR
        A0["Start"] --> A1["using_vec = _maybe_load_sqlite_vec(db)"];

        A1 --> A2{"'repo_embeddings' table exists?"};
        A2 -- No --> A3{using_vec?};
        A3 -- Yes --> A4["CREATE VIRTUAL TABLE repo_embeddings"];
        A3 -- No  --> A5["CREATE TABLE repo_embeddings (standard types)"];
        A4 --> A6_Next_Step_Repo_Embeddings;
        A5 --> A6_Next_Step_Repo_Embeddings;
        A2 -- Yes --> A6_Next_Step_Repo_Embeddings["Next"];

        A6_Next_Step_Repo_Embeddings --> B2{"'readme_chunk_embeddings' table exists?"};
        B2 -- No --> B3{using_vec?};
        B3 -- Yes --> B4["CREATE VIRTUAL TABLE readme_chunk_embeddings + INDEX"];
        B3 -- No  --> B5["CREATE TABLE readme_chunk_embeddings (standard types)"];
        B4 --> B6_Next_Step_Readme_Chunk_Embeddings;
        B5 --> B6_Next_Step_Readme_Chunk_Embeddings;
        B2 -- Yes --> B6_Next_Step_Readme_Chunk_Embeddings["Next"];

        B6_Next_Step_Readme_Chunk_Embeddings --> C2{"'repo_build_files' table exists?"};
        C2 -- No --> C3["CREATE TABLE repo_build_files"];
        C3 --> C4_Next_Step_Repo_Build_Files;
        C2 -- Yes --> C4_Next_Step_Repo_Build_Files["Next"];

        C4_Next_Step_Repo_Build_Files --> D2{"'repo_metadata' table exists?"};
        D2 -- No --> D3["CREATE TABLE repo_metadata"];
        D3 --> D4_End["End"];
        D2 -- Yes --> D4_End;
    end
Loading

File-Level Changes

Change Details Files
Introduce embedding table setup and README chunking utility
  • add a lazy loader for the sqlite-vec extension
  • create ensure_embedding_tables to define virtual or regular embedding tables with appropriate schema and indexes
  • add chunk_readme to split README text via StatisticalChunker fallback or blank lines
  • call ensure_embedding_tables within the database initialization path
github_to_sqlite/utils.py
Add migrate CLI command
  • implement migrate Click command that invokes embedding setup
  • output a confirmation message upon completion
github_to_sqlite/cli.py
Update project dependencies to support embeddings and chunking
  • add runtime dependencies for sentence-transformers, sqlite-vec, nltk, onnx, pydantic, tokenizers
  • expand extras_require for testing, semantic_chunkers, ONNX, GPU usage
setup.py
Document migration workflow and planning
  • add “Running migrations” section to README
  • introduce detailed embedding feature plan in PLAN.md
  • add AGENTS.md with development notes pointing to the plan
README.md
PLAN.md
AGENTS.md
Introduce chunker and tokenization modules
  • add SimpleChunker and BaseSplitter implementations
  • add BasicSentencizerChunker for token-vector chunking
  • add Token dataclass with tokenizer loader
  • add pydantic-based config for default model and settings
github_to_sqlite/simple_chunker.py
github_to_sqlite/sentencizer_chunker.py
github_to_sqlite/tokenization.py
github_to_sqlite/config.py
Add and update tests for new features
  • add tests for migrate command and embedding table creation
  • add tests for README chunking and both chunker implementations
  • add config and tokenizer loading tests
  • update existing tests to expect new tables in test_repos and test_starred
tests/test_migrate_command.py
tests/test_embedding_tables.py
tests/test_chunk_readme.py
tests/test_simple_chunker.py
tests/test_sentencizer_chunker.py
tests/test_config.py
tests/test_repos.py
tests/test_starred.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:

  • Move heavy ML dependencies (sentence-transformers, sqlite-vec, nltk, onnx, pydantic, tokenizers) out of install_requires into an extras_require (e.g. "embeddings") so the base package stays lightweight.
  • Have the migrate command call ensure_db_shape (instead of only ensure_embedding_tables) to apply foreign keys, FTS and indexes as well as embedding tables in one consistent migration step.
  • Avoid broad except Exception: blocks in _maybe_load_sqlite_vec and chunk_readme; catch specific errors (ImportError, sqlite errors) so unrelated bugs aren’t silently swallowed.
Here's what I looked at during the review
  • 🟡 General issues: 3 issues found
  • 🟢 Security: all looks good
  • 🟡 Testing: 6 issues found
  • 🟡 Complexity: 3 issues found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

],
"sentence-transformers": ["sentence-transformers[onnx]"],
"gpu": ["sentence-transformers[onnx-gpu]"],
"semantic-transformers": [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Remove duplicate semantic-chunkers extras

Both extras reference the same package URL; consider merging them to avoid redundancy.

offsets: tuple[int, int]


load_tokenizer = partial(Tokenizer.from_pretrained, config.default_model)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Cache the tokenizer instance to improve performance

Repeatedly calling Tokenizer.from_pretrained is costly; load it once and reuse the instance to improve efficiency.

from github_to_sqlite import utils


def test_embedding_tables_created():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Enhance embedding table tests to cover sqlite-vec variations, schema differences, and foreign key scenarios.

Add tests for utils.ensure_embedding_tables to cover: (1) behavior with and without sqlite-vec (mocking _maybe_load_sqlite_vec), (2) correct handling of foreign keys depending on the presence of the repos table, and (3) idempotency when called multiple times.

vecs = [np.array([1]), np.array([2]), np.array([3])]
chunker = BasicSentencizerChunker()
chunks = chunker(tokens, vecs)
assert len(chunks) == 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add tests for ValueError on mismatched lengths and edge cases like empty or no-delimiter inputs in BasicSentencizerChunker.

Please add tests for: (1) mismatched tokens and vectors lengths (should raise ValueError), (2) empty input lists (should return empty chunks), and (3) no period_token in tokens (should return empty chunks).

from github_to_sqlite import utils


def test_chunk_readme_fallback():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add edge case tests for the chunk_readme fallback mechanism.

Please add tests for the following cases: (1) empty input string, (2) string with no blank lines, and (3) string with only blank lines or whitespace, to fully cover the fallback logic.


def test_simple_chunker_drops_partial(tmp_path):
text = "Sentence one. Sentence two. Sentence three. Sentence four. Sentence five. Sentence six. Extra." # 7 sentences
chunker = SimpleChunker(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: SimpleChunker's _split method appears to ignore the splitter argument provided during initialization.

The _split method uses nltk.sent_tokenize instead of the provided splitter, which can cause confusion in both the class interface and the test. Please clarify whether SimpleChunker should always use sent_tokenize (and remove the splitter parameter if so), or update _split to use self.splitter if custom splitters are intended to be supported. Adjust the test accordingly to match the intended design.

return _SQLITE_VEC_LOADED


def ensure_embedding_tables(db):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider refactoring the repeated table creation logic into a single loop driven by a declarative table specification.

It’s hard to maintain four almost‐identical `if using_vec … else …` blocks. You can drive table creation from a small declarative spec and a single loop. For example:

```python
# at module top
_EMBED_TABLE_SPECS = [
    {
        "name": "repo_embeddings",
        "vec_cols": [
            "repo_id int primary key",
            "title_embedding float[768]",
            "description_embedding float[768]",
            "readme_embedding float[768]",
        ],
        "py_cols": {
            "repo_id": int,
            "title_embedding": bytes,
            "description_embedding": bytes,
            "readme_embedding": bytes,
        },
        "pk": "repo_id",
        "index": None,
    },
    {
        "name": "readme_chunk_embeddings",
        "vec_cols": [
            "repo_id int",
            "chunk_index int",
            "chunk_text text",
            "embedding float[768]",
        ],
        "py_cols": {
            "repo_id": int,
            "chunk_index": int,
            "chunk_text": str,
            "embedding": bytes,
        },
        "pk": ("repo_id", "chunk_index"),
        # one extra index after create
        "index": "CREATE INDEX IF NOT EXISTS readme_chunk_idx "
                 "ON readme_chunk_embeddings(repo_id, chunk_index)",
    },
    # add the non‐vec tables there too, e.g. repo_build_files, repo_metadata
]
def ensure_embedding_tables(db):
    using_vec = _maybe_load_sqlite_vec(db)
    existing = set(db.table_names())
    repos_fk = [("repo_id", "repos", "id")] if "repos" in existing else []

    for spec in _EMBED_TABLE_SPECS:
        name = spec["name"]
        if name in existing:
            continue

        if using_vec:
            cols_sql = ",\n    ".join(spec["vec_cols"])
            sql = f"CREATE VIRTUAL TABLE {name} USING vec0(\n    {cols_sql}\n)"
            db.execute(sql)
        else:
            db[name].create(
                spec["py_cols"],
                pk=spec["pk"],
                foreign_keys=repos_fk,
            )

        if spec.get("index"):
            db.execute(spec["index"])

This:

  • Unifies vec vs non-vec logic in one loop
  • Removes repeated branching & foreign-key checks
  • Keeps full functionality
  • Makes adding/removing tables trivial (just edit _EMBED_TABLE_SPECS)

extras_require={
"test": ["pytest", "pytest-cov", "requests-mock", "bs4"],
"semantic_chunkers": [
"semantic-chunkers @ https://github.com/aurelio-labs/semantic-chunkers/archive/refs/tags/v0.1.1.tar.gz"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider defining the repeated semantic-chunkers dependency as a constant and reusing it in extras_require to avoid duplication.

You can collapse the duplicated "semantic-chunkers @ …" spec into a single constant and then reuse it in both extras. This keeps the semantics the same, but DRYs-up the setup.py:

# setup.py

SEMANTIC_CHUNKERS = [
    "semantic-chunkers @ "
    "https://github.com/aurelio-labs/semantic-chunkers/"
    "archive/refs/tags/v0.1.1.tar.gz"
]

extras_require = {
    "test": ["pytest", "pytest-cov", "requests-mock", "bs4"],
    "semantic-chunkers": SEMANTIC_CHUNKERS,
    "semantic-transformers": SEMANTIC_CHUNKERS,
    "sentence-transformers": ["sentence-transformers[onnx]"],
    "gpu": ["sentence-transformers[onnx-gpu]"],
}

If you’d rather generate it in‐place, you can also do:

# setup.py

SEMANTIC_URL = (
    "semantic-chunkers @ "
    "https://github.com/aurelio-labs/semantic-chunkers/"
    "archive/refs/tags/v0.1.1.tar.gz"
)

extras_require = {
    "test": ["pytest", "pytest-cov", "requests-mock", "bs4"],
    **{k: [SEMANTIC_URL] for k in ("semantic-chunkers", "semantic-transformers")},
    "sentence-transformers": ["sentence-transformers[onnx]"],
    "gpu": ["sentence-transformers[onnx-gpu]"],
}

Either approach preserves all functionality but removes the duplicated literal.


from .config import config

try:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider replacing the Pydantic/dataclass fallback logic with plain dataclasses and minimal post-init hooks for clarity and simplicity.

You can collapse most of the Pydantic/dataclass‐fallback boilerplate into plain dataclasses with a small post-init hook and an optional Color helper.  This keeps all functionality, removes `Config`, `validator` and fallback stubs, and is much easier to read:

```python
# models.py
from typing import Any, List, Optional
from dataclasses import dataclass, field

# optional colorama
try:
    from colorama import Fore, Style
except ImportError:
    class _NoColor:
        RESET_ALL = ""
    Fore = Style = _NoColor()

@dataclass
class Chunk:
    content: str
    token_count: int
    is_triggered: bool = False
    triggered_score: float = 0.0

@dataclass
class BaseSplitter:
    def __call__(self, doc: str) -> List[str]:
        raise NotImplementedError
# chunkers.py
from .models import Chunk, BaseSplitter, Fore, Style
from typing import List
from .config import config

# assume DenseEncoder is always available; if not, raise at import-time
from semantic_router.encoders.base import DenseEncoder

@dataclass
class BaseChunker:
    name: str
    splitter: BaseSplitter
    encoder: DenseEncoder = field(default_factory=lambda: DenseEncoder(name="default"))

    def __call__(self, docs: List[str]) -> List[List[Chunk]]:
        return [self._chunk(self.splitter(doc)) for doc in docs]

    def _chunk(self, splits: List[str]) -> List[Chunk]:
        raise NotImplementedError

    def print(self, document_splits: List[Chunk]) -> None:
        colors = [Fore.RED, Fore.GREEN, Fore.BLUE, Fore.MAGENTA]
        for i, split in enumerate(document_splits):
            c = colors[i % len(colors)]
            status = (
                f"{split.triggered_score:.2f}" if split.is_triggered
                else "final split" if i == len(document_splits)-1
                else "token limit"
            )
            print(f"Split {i+1}, tokens {split.token_count}, triggered by: {status}")
            print(f"{c}{split.content}{Style.RESET_ALL}")
            print("-"*88, "\n")
# simple_chunker.py
from nltk.tokenize import sent_tokenize
from dataclasses import dataclass
from .models import Chunk, BaseSplitter
from .chunkers import BaseChunker
from .config import config
import nltk

@dataclass
class SimpleChunker(BaseChunker):
    target_length: int = config.max_length

    def __call__(self, docs: List[str]) -> List[List[Chunk]]:
        return super().__call__(docs)

    def _split(self, doc: str) -> List[str]:
        try:
            return sent_tokenize(doc)
        except LookupError:
            for res in ("punkt", "punkt_tab"):
                nltk.download(res)
                try:
                    return sent_tokenize(doc)
                except LookupError:
                    continue
            raise

    def _chunk(self, sentences: List[str]) -> List[Chunk]:
        chunks: List[Chunk] = []
        for i in range(0, len(sentences), self.target_length):
            piece = sentences[i : i + self.target_length]
            if len(piece) < self.target_length:
                break
            chunks.append(
                Chunk(content=" ".join(piece), token_count=len(piece))
            )
        return chunks

Steps:

  1. Drop all try/except Pydantic‐vs‐dataclass fallback; pick dataclasses for everything.
  2. Remove Config inner classes and @validator hacks.
  3. Use __post_init__ or field(default_factory=…) to inject defaults (e.g. encoder).
  4. Keep the colorama import small and mute on failure.

This preserves full behavior while cutting ~150 LOC of boilerplate.

Comment on lines +55 to +57
if v is None:
return DenseEncoder(name="default")
return v
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
if v is None:
return DenseEncoder(name="default")
return v
return DenseEncoder(name="default") if v is None else v

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments