Add migration CLI and document process by JeffCarpenter · Pull Request #2 · JeffCarpenter/github-to-sqlite

JeffCarpenter · 2025-06-11T17:07:16Z

Summary

implement migrate command to create embedding tables
document migration workflow in README
mark migration items complete in PLAN
test the new command

Testing

pytest -q
pytest --cov=github_to_sqlite -q

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Copilot

Pull Request Overview

This PR adds a new migrate CLI command to set up embedding-related tables, updates dependencies, and documents the migration workflow.

Introduce migrate command to create embedding tables (repo_embeddings, readme_chunk_embeddings, repo_build_files, repo_metadata)
Update utils.ensure_embedding_tables and wire it into ensure_db_shape
Add tests for migration, embedding tables, and chunking utilities; update README and PLAN to document the new workflow

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/test_starred.py	Expect new embedding-related tables in schema
tests/test_simple_chunker.py	Add unit tests for `SimpleChunker`
tests/test_sentencizer_chunker.py	Add tests for `BasicSentencizerChunker`
tests/test_repos.py	Expect new embedding tables in repo tests
tests/test_migrate_command.py	Test `migrate` CLI creates tables
tests/test_embedding_tables.py	Test `ensure_db_shape` creates embedding tables
tests/test_config.py	Test tokenizer config loading
tests/test_chunk_readme.py	Add tests for `chunk_readme` fallback and plugin
setup.py	Expand `install_requires` and `extras_require`
github_to_sqlite/utils.py	Implement `ensure_embedding_tables` and `chunk_readme`
github_to_sqlite/tokenization.py	Define `load_tokenizer` via `functools.partial`
github_to_sqlite/simple_chunker.py	Add `SimpleChunker` with sentence splitting
github_to_sqlite/sentencizer_chunker.py	Add `BasicSentencizerChunker`
github_to_sqlite/config.py	Introduce `Config` model for defaults
github_to_sqlite/cli.py	Add `migrate` command
README.md	Document running migrations
PLAN.md	Mark embedding-plan items as complete
AGENTS.md	Link to `PLAN.md`

Comments suppressed due to low confidence (1)

github_to_sqlite/utils.py:940

The function uses re.split but re is not imported; add import re at the top of the file to avoid a NameError.

return [p.strip() for p in re.split(r"\n{2,}", text) if p.strip()]

Copilot · 2025-06-11T17:08:09Z

setup.py

+    extras_require={
+        "test": ["pytest", "pytest-cov", "requests-mock", "bs4"],
+        "semantic_chunkers": [
+            "semantic-chunkers @ https://github.com/aurelio-labs/semantic-chunkers/archive/refs/tags/v0.1.1.tar.gz"


[nitpick] The same GitHub URL appears under both semantic_chunkers and semantic-transformers extras; consider consolidating or renaming one key to avoid duplication.

sourcery-ai · 2025-06-11T17:46:01Z

Reviewer's Guide

Introduces a migration CLI to initialize embedding storage tables via a new utility that conditionally loads the sqlite-vec extension and creates necessary tables, adds functions to chunk README content, extends dependencies for embedding and chunking, documents the migration workflow, and supplies tests for all added features.

Sequence Diagram for `migrate` Command Execution

sequenceDiagram
    actor User
    participant CLI as "CLI (migrate cmd)"
    participant DB as "sqlite_utils.Database"
    participant Utils as "github_to_sqlite.utils"

    User->>CLI: Executes `github-to-sqlite migrate github.db`
    CLI->>DB: new Database(db_path)
    CLI->>Utils: ensure_embedding_tables(db)
    activate Utils
    Utils->>Utils: _maybe_load_sqlite_vec(db)
    alt sqlite-vec available
        Utils->>DB: execute("CREATE VIRTUAL TABLE repo_embeddings ...")
        Utils->>DB: execute("CREATE VIRTUAL TABLE readme_chunk_embeddings ...")
    else sqlite-vec not available
        Utils->>DB: db["repo_embeddings"].create(...)
        Utils->>DB: db["readme_chunk_embeddings"].create(...)
    end
    Utils->>DB: db["repo_build_files"].create(...)
    Utils->>DB: db["repo_metadata"].create(...)
    deactivate Utils
    CLI->>User: Prints "Database migrated"

Entity Relationship Diagram for New Embedding Tables

erDiagram
    repos {
        int id PK
        string name
    }
    repo_embeddings {
        int repo_id PK
        string title_embedding "BLOB or VECTOR (float[768])"
        string description_embedding "BLOB or VECTOR (float[768])"
        string readme_embedding "BLOB or VECTOR (float[768])"
    }
    readme_chunk_embeddings {
        int repo_id PK
        int chunk_index PK
        string chunk_text
        string embedding "BLOB or VECTOR (float[768])"
    }
    repo_build_files {
        int repo_id PK
        string file_path PK
        string metadata "JSON"
    }
    repo_metadata {
        int repo_id PK
        string language
        string directory_tree
    }

    repos ||--o{ repo_embeddings : "FK: repo_id"
    repos ||--o{ readme_chunk_embeddings : "FK: repo_id"
    repos ||--o{ repo_build_files : "FK: repo_id"
    repos ||--o{ repo_metadata : "FK: repo_id"

Class Diagram for `Config` in `config.py`

classDiagram
      class Config {
        +str default_model
        +str onnx_provider
        +int max_length
      }

Class Diagram for `Token` Dataclass in `tokenization.py`

classDiagram
      class Token {
        <<Dataclass>>
        +int id
        +str value
        +tuple offsets
      }

Class Diagram for `BasicSentencizerChunker` in `sentencizer_chunker.py`

classDiagram
      class BasicSentencizerChunker {
        +str period_token
        +__init__(period_token: str)
        +chunk(tokens: Sequence~str~, vectors: Sequence~TensorType~) List~List~TensorType~~
        +__call__(tokens: Sequence~str~, vectors: Sequence~TensorType~) List~List~TensorType~~
      }

Class Diagram for Chunker Classes in `simple_chunker.py`

classDiagram
    direction LR
    class BaseModel {
      <<Pydantic / Fallback>>
      note "Base for Pydantic models or fallback"
    }
    class BaseSplitter {
      <<Abstract>>
      +__call__(doc: str) List~str~
    }
    BaseSplitter --|> BaseModel

    class Chunk {
      <<Data / Fallback>>
      +str content
      +int token_count
      +bool is_triggered
      +float triggered_score
      note "Data structure for a text chunk"
    }

    class DenseEncoder {
      <<External>>
      note "From semantic_router.encoders.base"
    }

    class BaseChunker {
      +str name
      +DenseEncoder encoder
      +BaseSplitter splitter
      +__call__(docs: List~str~) List~List~Chunk~~
      #_split(doc: str) List~str~
      #_chunk(splits: List~Any~) List~Chunk~~
      +print(document_splits: List~Chunk~) void
    }
    BaseChunker --|> BaseModel
    BaseChunker ..> DenseEncoder : uses (optional)
    BaseChunker o-- BaseSplitter : uses

    class SimpleChunker {
      +int target_length
      +__init__(name: str, splitter: BaseSplitter, encoder: DenseEncoder, target_length: int)
      +__call__(docs: List~str~) List~List~Chunk~~
      #_split(doc: str) List~str~
      #_chunk(sentences: List~str~) List~Chunk~~
    }
    SimpleChunker --|> BaseChunker

Flow Diagram for `ensure_embedding_tables` Table Creation Logic

flowchart TD
    subgraph "ensure_embedding_tables(db) Logic"
        direction LR
        A0["Start"] --> A1["using_vec = _maybe_load_sqlite_vec(db)"];

        A1 --> A2{"'repo_embeddings' table exists?"};
        A2 -- No --> A3{using_vec?};
        A3 -- Yes --> A4["CREATE VIRTUAL TABLE repo_embeddings"];
        A3 -- No  --> A5["CREATE TABLE repo_embeddings (standard types)"];
        A4 --> A6_Next_Step_Repo_Embeddings;
        A5 --> A6_Next_Step_Repo_Embeddings;
        A2 -- Yes --> A6_Next_Step_Repo_Embeddings["Next"];

        A6_Next_Step_Repo_Embeddings --> B2{"'readme_chunk_embeddings' table exists?"};
        B2 -- No --> B3{using_vec?};
        B3 -- Yes --> B4["CREATE VIRTUAL TABLE readme_chunk_embeddings + INDEX"];
        B3 -- No  --> B5["CREATE TABLE readme_chunk_embeddings (standard types)"];
        B4 --> B6_Next_Step_Readme_Chunk_Embeddings;
        B5 --> B6_Next_Step_Readme_Chunk_Embeddings;
        B2 -- Yes --> B6_Next_Step_Readme_Chunk_Embeddings["Next"];

        B6_Next_Step_Readme_Chunk_Embeddings --> C2{"'repo_build_files' table exists?"};
        C2 -- No --> C3["CREATE TABLE repo_build_files"];
        C3 --> C4_Next_Step_Repo_Build_Files;
        C2 -- Yes --> C4_Next_Step_Repo_Build_Files["Next"];

        C4_Next_Step_Repo_Build_Files --> D2{"'repo_metadata' table exists?"};
        D2 -- No --> D3["CREATE TABLE repo_metadata"];
        D3 --> D4_End["End"];
        D2 -- Yes --> D4_End;
    end

File-Level Changes

Change	Details	Files
Introduce embedding table setup and README chunking utility	add a lazy loader for the sqlite-vec extension create ensure_embedding_tables to define virtual or regular embedding tables with appropriate schema and indexes add chunk_readme to split README text via StatisticalChunker fallback or blank lines call ensure_embedding_tables within the database initialization path	`github_to_sqlite/utils.py`
Add migrate CLI command	implement `migrate` Click command that invokes embedding setup output a confirmation message upon completion	`github_to_sqlite/cli.py`
Update project dependencies to support embeddings and chunking	add runtime dependencies for sentence-transformers, sqlite-vec, nltk, onnx, pydantic, tokenizers expand extras_require for testing, semantic_chunkers, ONNX, GPU usage	`setup.py`
Document migration workflow and planning	add “Running migrations” section to README introduce detailed embedding feature plan in PLAN.md add AGENTS.md with development notes pointing to the plan	`README.md` `PLAN.md` `AGENTS.md`
Introduce chunker and tokenization modules	add SimpleChunker and BaseSplitter implementations add BasicSentencizerChunker for token-vector chunking add Token dataclass with tokenizer loader add pydantic-based config for default model and settings	`github_to_sqlite/simple_chunker.py` `github_to_sqlite/sentencizer_chunker.py` `github_to_sqlite/tokenization.py` `github_to_sqlite/config.py`
Add and update tests for new features	add tests for migrate command and embedding table creation add tests for README chunking and both chunker implementations add config and tokenizer loading tests update existing tests to expect new tables in test_repos and test_starred	`tests/test_migrate_command.py` `tests/test_embedding_tables.py` `tests/test_chunk_readme.py` `tests/test_simple_chunker.py` `tests/test_sentencizer_chunker.py` `tests/test_config.py` `tests/test_repos.py` `tests/test_starred.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:

Move heavy ML dependencies (sentence-transformers, sqlite-vec, nltk, onnx, pydantic, tokenizers) out of install_requires into an extras_require (e.g. "embeddings") so the base package stays lightweight.
Have the migrate command call ensure_db_shape (instead of only ensure_embedding_tables) to apply foreign keys, FTS and indexes as well as embedding tables in one consistent migration step.
Avoid broad except Exception: blocks in _maybe_load_sqlite_vec and chunk_readme; catch specific errors (ImportError, sqlite errors) so unrelated bugs aren’t silently swallowed.

Here's what I looked at during the review

🟡 General issues: 3 issues found
🟢 Security: all looks good
🟡 Testing: 6 issues found
🟡 Complexity: 3 issues found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-06-11T17:49:58Z

setup.py

+        ],
+        "sentence-transformers": ["sentence-transformers[onnx]"],
+        "gpu": ["sentence-transformers[onnx-gpu]"],
+        "semantic-transformers": [


suggestion: Remove duplicate semantic-chunkers extras

Both extras reference the same package URL; consider merging them to avoid redundancy.

sourcery-ai · 2025-06-11T17:49:58Z

github_to_sqlite/tokenization.py

+    offsets: tuple[int, int]
+
+
+load_tokenizer = partial(Tokenizer.from_pretrained, config.default_model)


suggestion (performance): Cache the tokenizer instance to improve performance

Repeatedly calling Tokenizer.from_pretrained is costly; load it once and reuse the instance to improve efficiency.

sourcery-ai · 2025-06-11T17:49:58Z

tests/test_embedding_tables.py

+from github_to_sqlite import utils
+
+
+def test_embedding_tables_created():


suggestion (testing): Enhance embedding table tests to cover sqlite-vec variations, schema differences, and foreign key scenarios.

Add tests for utils.ensure_embedding_tables to cover: (1) behavior with and without sqlite-vec (mocking _maybe_load_sqlite_vec), (2) correct handling of foreign keys depending on the presence of the repos table, and (3) idempotency when called multiple times.

sourcery-ai · 2025-06-11T17:49:58Z

tests/test_sentencizer_chunker.py

+    vecs = [np.array([1]), np.array([2]), np.array([3])]
+    chunker = BasicSentencizerChunker()
+    chunks = chunker(tokens, vecs)
+    assert len(chunks) == 1


suggestion (testing): Add tests for ValueError on mismatched lengths and edge cases like empty or no-delimiter inputs in BasicSentencizerChunker.

Please add tests for: (1) mismatched tokens and vectors lengths (should raise ValueError), (2) empty input lists (should return empty chunks), and (3) no period_token in tokens (should return empty chunks).

sourcery-ai · 2025-06-11T17:49:58Z

tests/test_chunk_readme.py

+from github_to_sqlite import utils
+
+
+def test_chunk_readme_fallback():


suggestion (testing): Add edge case tests for the chunk_readme fallback mechanism.

Please add tests for the following cases: (1) empty input string, (2) string with no blank lines, and (3) string with only blank lines or whitespace, to fully cover the fallback logic.

sourcery-ai · 2025-06-11T17:49:59Z

tests/test_simple_chunker.py

+
+def test_simple_chunker_drops_partial(tmp_path):
+    text = "Sentence one. Sentence two. Sentence three. Sentence four. Sentence five. Sentence six. Extra."  # 7 sentences
+    chunker = SimpleChunker(


issue: SimpleChunker's _split method appears to ignore the splitter argument provided during initialization.

The _split method uses nltk.sent_tokenize instead of the provided splitter, which can cause confusion in both the class interface and the test. Please clarify whether SimpleChunker should always use sent_tokenize (and remove the splitter parameter if so), or update _split to use self.splitter if custom splitters are intended to be supported. Adjust the test accordingly to match the intended design.

sourcery-ai · 2025-06-11T17:49:59Z

github_to_sqlite/utils.py

+    return _SQLITE_VEC_LOADED
+
+
+def ensure_embedding_tables(db):


issue (complexity): Consider refactoring the repeated table creation logic into a single loop driven by a declarative table specification.

It’s hard to maintain four almost‐identical `if using_vec … else …` blocks. You can drive table creation from a small declarative spec and a single loop. For example: ```python # at module top _EMBED_TABLE_SPECS = [ { "name": "repo_embeddings", "vec_cols": [ "repo_id int primary key", "title_embedding float[768]", "description_embedding float[768]", "readme_embedding float[768]", ], "py_cols": { "repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes, }, "pk": "repo_id", "index": None, }, { "name": "readme_chunk_embeddings", "vec_cols": [ "repo_id int", "chunk_index int", "chunk_text text", "embedding float[768]", ], "py_cols": { "repo_id": int, "chunk_index": int, "chunk_text": str, "embedding": bytes, }, "pk": ("repo_id", "chunk_index"), # one extra index after create "index": "CREATE INDEX IF NOT EXISTS readme_chunk_idx " "ON readme_chunk_embeddings(repo_id, chunk_index)", }, # add the non‐vec tables there too, e.g. repo_build_files, repo_metadata ]

def ensure_embedding_tables(db): using_vec = _maybe_load_sqlite_vec(db) existing = set(db.table_names()) repos_fk = [("repo_id", "repos", "id")] if "repos" in existing else [] for spec in _EMBED_TABLE_SPECS: name = spec["name"] if name in existing: continue if using_vec: cols_sql = ",\n ".join(spec["vec_cols"]) sql = f"CREATE VIRTUAL TABLE {name} USING vec0(\n {cols_sql}\n)" db.execute(sql) else: db[name].create( spec["py_cols"], pk=spec["pk"], foreign_keys=repos_fk, ) if spec.get("index"): db.execute(spec["index"])

This:

Unifies vec vs non-vec logic in one loop

Removes repeated branching & foreign-key checks

Keeps full functionality

Makes adding/removing tables trivial (just edit _EMBED_TABLE_SPECS)

sourcery-ai · 2025-06-11T17:49:59Z

setup.py

+    extras_require={
+        "test": ["pytest", "pytest-cov", "requests-mock", "bs4"],
+        "semantic_chunkers": [
+            "semantic-chunkers @ https://github.com/aurelio-labs/semantic-chunkers/archive/refs/tags/v0.1.1.tar.gz"


issue (complexity): Consider defining the repeated semantic-chunkers dependency as a constant and reusing it in extras_require to avoid duplication.

You can collapse the duplicated "semantic-chunkers @ …" spec into a single constant and then reuse it in both extras. This keeps the semantics the same, but DRYs-up the setup.py:

# setup.py SEMANTIC_CHUNKERS = [ "semantic-chunkers @ " "https://github.com/aurelio-labs/semantic-chunkers/" "archive/refs/tags/v0.1.1.tar.gz" ] extras_require = { "test": ["pytest", "pytest-cov", "requests-mock", "bs4"], "semantic-chunkers": SEMANTIC_CHUNKERS, "semantic-transformers": SEMANTIC_CHUNKERS, "sentence-transformers": ["sentence-transformers[onnx]"], "gpu": ["sentence-transformers[onnx-gpu]"], }

If you’d rather generate it in‐place, you can also do:

# setup.py SEMANTIC_URL = ( "semantic-chunkers @ " "https://github.com/aurelio-labs/semantic-chunkers/" "archive/refs/tags/v0.1.1.tar.gz" ) extras_require = { "test": ["pytest", "pytest-cov", "requests-mock", "bs4"], **{k: [SEMANTIC_URL] for k in ("semantic-chunkers", "semantic-transformers")}, "sentence-transformers": ["sentence-transformers[onnx]"], "gpu": ["sentence-transformers[onnx-gpu]"], }

Either approach preserves all functionality but removes the duplicated literal.

sourcery-ai · 2025-06-11T17:49:59Z

github_to_sqlite/simple_chunker.py

+
+from .config import config
+
+try:


issue (complexity): Consider replacing the Pydantic/dataclass fallback logic with plain dataclasses and minimal post-init hooks for clarity and simplicity.

You can collapse most of the Pydantic/dataclass‐fallback boilerplate into plain dataclasses with a small post-init hook and an optional Color helper. This keeps all functionality, removes `Config`, `validator` and fallback stubs, and is much easier to read: ```python # models.py from typing import Any, List, Optional from dataclasses import dataclass, field # optional colorama try: from colorama import Fore, Style except ImportError: class _NoColor: RESET_ALL = "" Fore = Style = _NoColor() @dataclass class Chunk: content: str token_count: int is_triggered: bool = False triggered_score: float = 0.0 @dataclass class BaseSplitter: def __call__(self, doc: str) -> List[str]: raise NotImplementedError

# chunkers.py from .models import Chunk, BaseSplitter, Fore, Style from typing import List from .config import config # assume DenseEncoder is always available; if not, raise at import-time from semantic_router.encoders.base import DenseEncoder @dataclass class BaseChunker: name: str splitter: BaseSplitter encoder: DenseEncoder = field(default_factory=lambda: DenseEncoder(name="default")) def __call__(self, docs: List[str]) -> List[List[Chunk]]: return [self._chunk(self.splitter(doc)) for doc in docs] def _chunk(self, splits: List[str]) -> List[Chunk]: raise NotImplementedError def print(self, document_splits: List[Chunk]) -> None: colors = [Fore.RED, Fore.GREEN, Fore.BLUE, Fore.MAGENTA] for i, split in enumerate(document_splits): c = colors[i % len(colors)] status = ( f"{split.triggered_score:.2f}" if split.is_triggered else "final split" if i == len(document_splits)-1 else "token limit" ) print(f"Split {i+1}, tokens {split.token_count}, triggered by: {status}") print(f"{c}{split.content}{Style.RESET_ALL}") print("-"*88, "\n")

# simple_chunker.py from nltk.tokenize import sent_tokenize from dataclasses import dataclass from .models import Chunk, BaseSplitter from .chunkers import BaseChunker from .config import config import nltk @dataclass class SimpleChunker(BaseChunker): target_length: int = config.max_length def __call__(self, docs: List[str]) -> List[List[Chunk]]: return super().__call__(docs) def _split(self, doc: str) -> List[str]: try: return sent_tokenize(doc) except LookupError: for res in ("punkt", "punkt_tab"): nltk.download(res) try: return sent_tokenize(doc) except LookupError: continue raise def _chunk(self, sentences: List[str]) -> List[Chunk]: chunks: List[Chunk] = [] for i in range(0, len(sentences), self.target_length): piece = sentences[i : i + self.target_length] if len(piece) < self.target_length: break chunks.append( Chunk(content=" ".join(piece), token_count=len(piece)) ) return chunks

Steps:

Drop all try/except Pydantic‐vs‐dataclass fallback; pick dataclasses for everything.

Remove Config inner classes and @validator hacks.

Use __post_init__ or field(default_factory=…) to inject defaults (e.g. encoder).

Keep the colorama import small and mute on failure.

This preserves full behavior while cutting ~150 LOC of boilerplate.

sourcery-ai · 2025-06-11T17:49:59Z

github_to_sqlite/simple_chunker.py

+        if v is None:
+            return DenseEncoder(name="default")
+        return v


suggestion (code-quality): We've found these issues:

Lift code into else after jump in control flow (reintroduce-else)

Replace if statement with if expression (assign-if-exp)

Suggested change

if v is None:

return DenseEncoder(name="default")

return v

return DenseEncoder(name="default") if v is None else v

…-integration-with-main

Fix PLAN checkbox

a3a1b1a

Copilot AI review requested due to automatic review settings June 11, 2025 17:07

JeffCarpenter added the codex label Jun 11, 2025 — with ChatGPT Codex Connector

Copilot AI reviewed Jun 11, 2025

View reviewed changes

sourcery-ai bot reviewed Jun 11, 2025

View reviewed changes

JeffCarpenter added 2 commits July 20, 2025 15:24

Document completed features and cleanup

b55884c

Merge pull request #12 from JeffCarpenter/codex/refactor-for-seamless…

70f5a30

…-integration-with-main

		offsets: tuple[int, int]


		load_tokenizer = partial(Tokenizer.from_pretrained, config.default_model)

		from github_to_sqlite import utils


		def test_embedding_tables_created():

		from github_to_sqlite import utils


		def test_chunk_readme_fallback():

Conversation

JeffCarpenter commented Jun 11, 2025

Summary

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot commented Jun 11, 2025

Reviewer's Guide

Sequence Diagram for migrate Command Execution

Entity Relationship Diagram for New Embedding Tables

Class Diagram for Config in config.py

Class Diagram for Token Dataclass in tokenization.py

Class Diagram for BasicSentencizerChunker in sentencizer_chunker.py

Class Diagram for Chunker Classes in simple_chunker.py

Flow Diagram for ensure_embedding_tables Table Creation Logic

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Sequence Diagram for `migrate` Command Execution

Class Diagram for `Config` in `config.py`

Class Diagram for `Token` Dataclass in `tokenization.py`

Class Diagram for `BasicSentencizerChunker` in `sentencizer_chunker.py`

Class Diagram for Chunker Classes in `simple_chunker.py`

Flow Diagram for `ensure_embedding_tables` Table Creation Logic