Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds a new migrate CLI command to set up embedding-related tables, updates dependencies, and documents the migration workflow.
- Introduce
migratecommand to create embedding tables (repo_embeddings,readme_chunk_embeddings,repo_build_files,repo_metadata) - Update
utils.ensure_embedding_tablesand wire it intoensure_db_shape - Add tests for migration, embedding tables, and chunking utilities; update README and PLAN to document the new workflow
Reviewed Changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_starred.py | Expect new embedding-related tables in schema |
| tests/test_simple_chunker.py | Add unit tests for SimpleChunker |
| tests/test_sentencizer_chunker.py | Add tests for BasicSentencizerChunker |
| tests/test_repos.py | Expect new embedding tables in repo tests |
| tests/test_migrate_command.py | Test migrate CLI creates tables |
| tests/test_embedding_tables.py | Test ensure_db_shape creates embedding tables |
| tests/test_config.py | Test tokenizer config loading |
| tests/test_chunk_readme.py | Add tests for chunk_readme fallback and plugin |
| setup.py | Expand install_requires and extras_require |
| github_to_sqlite/utils.py | Implement ensure_embedding_tables and chunk_readme |
| github_to_sqlite/tokenization.py | Define load_tokenizer via functools.partial |
| github_to_sqlite/simple_chunker.py | Add SimpleChunker with sentence splitting |
| github_to_sqlite/sentencizer_chunker.py | Add BasicSentencizerChunker |
| github_to_sqlite/config.py | Introduce Config model for defaults |
| github_to_sqlite/cli.py | Add migrate command |
| README.md | Document running migrations |
| PLAN.md | Mark embedding-plan items as complete |
| AGENTS.md | Link to PLAN.md |
Comments suppressed due to low confidence (1)
github_to_sqlite/utils.py:940
- The function uses
re.splitbutreis not imported; addimport reat the top of the file to avoid a NameError.
return [p.strip() for p in re.split(r"\n{2,}", text) if p.strip()]
| extras_require={ | ||
| "test": ["pytest", "pytest-cov", "requests-mock", "bs4"], | ||
| "semantic_chunkers": [ | ||
| "semantic-chunkers @ https://github.com/aurelio-labs/semantic-chunkers/archive/refs/tags/v0.1.1.tar.gz" |
There was a problem hiding this comment.
[nitpick] The same GitHub URL appears under both semantic_chunkers and semantic-transformers extras; consider consolidating or renaming one key to avoid duplication.
Reviewer's GuideIntroduces a migration CLI to initialize embedding storage tables via a new utility that conditionally loads the sqlite-vec extension and creates necessary tables, adds functions to chunk README content, extends dependencies for embedding and chunking, documents the migration workflow, and supplies tests for all added features. Sequence Diagram for
|
| Change | Details | Files |
|---|---|---|
| Introduce embedding table setup and README chunking utility |
|
github_to_sqlite/utils.py |
| Add migrate CLI command |
|
github_to_sqlite/cli.py |
| Update project dependencies to support embeddings and chunking |
|
setup.py |
| Document migration workflow and planning |
|
README.mdPLAN.mdAGENTS.md |
| Introduce chunker and tokenization modules |
|
github_to_sqlite/simple_chunker.pygithub_to_sqlite/sentencizer_chunker.pygithub_to_sqlite/tokenization.pygithub_to_sqlite/config.py |
| Add and update tests for new features |
|
tests/test_migrate_command.pytests/test_embedding_tables.pytests/test_chunk_readme.pytests/test_simple_chunker.pytests/test_sentencizer_chunker.pytests/test_config.pytests/test_repos.pytests/test_starred.py |
Tips and commands
Interacting with Sourcery
- Trigger a new review: Comment
@sourcery-ai reviewon the pull request. - Continue discussions: Reply directly to Sourcery's review comments.
- Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with@sourcery-ai issueto create an issue from it. - Generate a pull request title: Write
@sourcery-aianywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai titleon the pull request to (re-)generate the title at any time. - Generate a pull request summary: Write
@sourcery-ai summaryanywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment@sourcery-ai summaryon the pull request to
(re-)generate the summary at any time. - Generate reviewer's guide: Comment
@sourcery-ai guideon the pull
request to (re-)generate the reviewer's guide at any time. - Resolve all Sourcery comments: Comment
@sourcery-ai resolveon the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore. - Dismiss all Sourcery reviews: Comment
@sourcery-ai dismisson the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai reviewto trigger a new review!
Customizing Your Experience
Access your dashboard to:
- Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others. - Change the review language.
- Add, remove or edit custom review instructions.
- Adjust other review settings.
Getting Help
- Contact our support team for questions or feedback.
- Visit our documentation for detailed guides and information.
- Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.
There was a problem hiding this comment.
Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:
- Move heavy ML dependencies (sentence-transformers, sqlite-vec, nltk, onnx, pydantic, tokenizers) out of install_requires into an extras_require (e.g. "embeddings") so the base package stays lightweight.
- Have the
migratecommand call ensure_db_shape (instead of only ensure_embedding_tables) to apply foreign keys, FTS and indexes as well as embedding tables in one consistent migration step. - Avoid broad
except Exception:blocks in_maybe_load_sqlite_vecandchunk_readme; catch specific errors (ImportError, sqlite errors) so unrelated bugs aren’t silently swallowed.
Here's what I looked at during the review
- 🟡 General issues: 3 issues found
- 🟢 Security: all looks good
- 🟡 Testing: 6 issues found
- 🟡 Complexity: 3 issues found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| ], | ||
| "sentence-transformers": ["sentence-transformers[onnx]"], | ||
| "gpu": ["sentence-transformers[onnx-gpu]"], | ||
| "semantic-transformers": [ |
There was a problem hiding this comment.
suggestion: Remove duplicate semantic-chunkers extras
Both extras reference the same package URL; consider merging them to avoid redundancy.
| offsets: tuple[int, int] | ||
|
|
||
|
|
||
| load_tokenizer = partial(Tokenizer.from_pretrained, config.default_model) |
There was a problem hiding this comment.
suggestion (performance): Cache the tokenizer instance to improve performance
Repeatedly calling Tokenizer.from_pretrained is costly; load it once and reuse the instance to improve efficiency.
| from github_to_sqlite import utils | ||
|
|
||
|
|
||
| def test_embedding_tables_created(): |
There was a problem hiding this comment.
suggestion (testing): Enhance embedding table tests to cover sqlite-vec variations, schema differences, and foreign key scenarios.
Add tests for utils.ensure_embedding_tables to cover: (1) behavior with and without sqlite-vec (mocking _maybe_load_sqlite_vec), (2) correct handling of foreign keys depending on the presence of the repos table, and (3) idempotency when called multiple times.
| vecs = [np.array([1]), np.array([2]), np.array([3])] | ||
| chunker = BasicSentencizerChunker() | ||
| chunks = chunker(tokens, vecs) | ||
| assert len(chunks) == 1 |
There was a problem hiding this comment.
suggestion (testing): Add tests for ValueError on mismatched lengths and edge cases like empty or no-delimiter inputs in BasicSentencizerChunker.
Please add tests for: (1) mismatched tokens and vectors lengths (should raise ValueError), (2) empty input lists (should return empty chunks), and (3) no period_token in tokens (should return empty chunks).
| from github_to_sqlite import utils | ||
|
|
||
|
|
||
| def test_chunk_readme_fallback(): |
There was a problem hiding this comment.
suggestion (testing): Add edge case tests for the chunk_readme fallback mechanism.
Please add tests for the following cases: (1) empty input string, (2) string with no blank lines, and (3) string with only blank lines or whitespace, to fully cover the fallback logic.
|
|
||
| def test_simple_chunker_drops_partial(tmp_path): | ||
| text = "Sentence one. Sentence two. Sentence three. Sentence four. Sentence five. Sentence six. Extra." # 7 sentences | ||
| chunker = SimpleChunker( |
There was a problem hiding this comment.
issue: SimpleChunker's _split method appears to ignore the splitter argument provided during initialization.
The _split method uses nltk.sent_tokenize instead of the provided splitter, which can cause confusion in both the class interface and the test. Please clarify whether SimpleChunker should always use sent_tokenize (and remove the splitter parameter if so), or update _split to use self.splitter if custom splitters are intended to be supported. Adjust the test accordingly to match the intended design.
| return _SQLITE_VEC_LOADED | ||
|
|
||
|
|
||
| def ensure_embedding_tables(db): |
There was a problem hiding this comment.
issue (complexity): Consider refactoring the repeated table creation logic into a single loop driven by a declarative table specification.
It’s hard to maintain four almost‐identical `if using_vec … else …` blocks. You can drive table creation from a small declarative spec and a single loop. For example:
```python
# at module top
_EMBED_TABLE_SPECS = [
{
"name": "repo_embeddings",
"vec_cols": [
"repo_id int primary key",
"title_embedding float[768]",
"description_embedding float[768]",
"readme_embedding float[768]",
],
"py_cols": {
"repo_id": int,
"title_embedding": bytes,
"description_embedding": bytes,
"readme_embedding": bytes,
},
"pk": "repo_id",
"index": None,
},
{
"name": "readme_chunk_embeddings",
"vec_cols": [
"repo_id int",
"chunk_index int",
"chunk_text text",
"embedding float[768]",
],
"py_cols": {
"repo_id": int,
"chunk_index": int,
"chunk_text": str,
"embedding": bytes,
},
"pk": ("repo_id", "chunk_index"),
# one extra index after create
"index": "CREATE INDEX IF NOT EXISTS readme_chunk_idx "
"ON readme_chunk_embeddings(repo_id, chunk_index)",
},
# add the non‐vec tables there too, e.g. repo_build_files, repo_metadata
]def ensure_embedding_tables(db):
using_vec = _maybe_load_sqlite_vec(db)
existing = set(db.table_names())
repos_fk = [("repo_id", "repos", "id")] if "repos" in existing else []
for spec in _EMBED_TABLE_SPECS:
name = spec["name"]
if name in existing:
continue
if using_vec:
cols_sql = ",\n ".join(spec["vec_cols"])
sql = f"CREATE VIRTUAL TABLE {name} USING vec0(\n {cols_sql}\n)"
db.execute(sql)
else:
db[name].create(
spec["py_cols"],
pk=spec["pk"],
foreign_keys=repos_fk,
)
if spec.get("index"):
db.execute(spec["index"])This:
- Unifies vec vs non-vec logic in one loop
- Removes repeated branching & foreign-key checks
- Keeps full functionality
- Makes adding/removing tables trivial (just edit
_EMBED_TABLE_SPECS)
| extras_require={ | ||
| "test": ["pytest", "pytest-cov", "requests-mock", "bs4"], | ||
| "semantic_chunkers": [ | ||
| "semantic-chunkers @ https://github.com/aurelio-labs/semantic-chunkers/archive/refs/tags/v0.1.1.tar.gz" |
There was a problem hiding this comment.
issue (complexity): Consider defining the repeated semantic-chunkers dependency as a constant and reusing it in extras_require to avoid duplication.
You can collapse the duplicated "semantic-chunkers @ …" spec into a single constant and then reuse it in both extras. This keeps the semantics the same, but DRYs-up the setup.py:
# setup.py
SEMANTIC_CHUNKERS = [
"semantic-chunkers @ "
"https://github.com/aurelio-labs/semantic-chunkers/"
"archive/refs/tags/v0.1.1.tar.gz"
]
extras_require = {
"test": ["pytest", "pytest-cov", "requests-mock", "bs4"],
"semantic-chunkers": SEMANTIC_CHUNKERS,
"semantic-transformers": SEMANTIC_CHUNKERS,
"sentence-transformers": ["sentence-transformers[onnx]"],
"gpu": ["sentence-transformers[onnx-gpu]"],
}If you’d rather generate it in‐place, you can also do:
# setup.py
SEMANTIC_URL = (
"semantic-chunkers @ "
"https://github.com/aurelio-labs/semantic-chunkers/"
"archive/refs/tags/v0.1.1.tar.gz"
)
extras_require = {
"test": ["pytest", "pytest-cov", "requests-mock", "bs4"],
**{k: [SEMANTIC_URL] for k in ("semantic-chunkers", "semantic-transformers")},
"sentence-transformers": ["sentence-transformers[onnx]"],
"gpu": ["sentence-transformers[onnx-gpu]"],
}Either approach preserves all functionality but removes the duplicated literal.
|
|
||
| from .config import config | ||
|
|
||
| try: |
There was a problem hiding this comment.
issue (complexity): Consider replacing the Pydantic/dataclass fallback logic with plain dataclasses and minimal post-init hooks for clarity and simplicity.
You can collapse most of the Pydantic/dataclass‐fallback boilerplate into plain dataclasses with a small post-init hook and an optional Color helper. This keeps all functionality, removes `Config`, `validator` and fallback stubs, and is much easier to read:
```python
# models.py
from typing import Any, List, Optional
from dataclasses import dataclass, field
# optional colorama
try:
from colorama import Fore, Style
except ImportError:
class _NoColor:
RESET_ALL = ""
Fore = Style = _NoColor()
@dataclass
class Chunk:
content: str
token_count: int
is_triggered: bool = False
triggered_score: float = 0.0
@dataclass
class BaseSplitter:
def __call__(self, doc: str) -> List[str]:
raise NotImplementedError# chunkers.py
from .models import Chunk, BaseSplitter, Fore, Style
from typing import List
from .config import config
# assume DenseEncoder is always available; if not, raise at import-time
from semantic_router.encoders.base import DenseEncoder
@dataclass
class BaseChunker:
name: str
splitter: BaseSplitter
encoder: DenseEncoder = field(default_factory=lambda: DenseEncoder(name="default"))
def __call__(self, docs: List[str]) -> List[List[Chunk]]:
return [self._chunk(self.splitter(doc)) for doc in docs]
def _chunk(self, splits: List[str]) -> List[Chunk]:
raise NotImplementedError
def print(self, document_splits: List[Chunk]) -> None:
colors = [Fore.RED, Fore.GREEN, Fore.BLUE, Fore.MAGENTA]
for i, split in enumerate(document_splits):
c = colors[i % len(colors)]
status = (
f"{split.triggered_score:.2f}" if split.is_triggered
else "final split" if i == len(document_splits)-1
else "token limit"
)
print(f"Split {i+1}, tokens {split.token_count}, triggered by: {status}")
print(f"{c}{split.content}{Style.RESET_ALL}")
print("-"*88, "\n")# simple_chunker.py
from nltk.tokenize import sent_tokenize
from dataclasses import dataclass
from .models import Chunk, BaseSplitter
from .chunkers import BaseChunker
from .config import config
import nltk
@dataclass
class SimpleChunker(BaseChunker):
target_length: int = config.max_length
def __call__(self, docs: List[str]) -> List[List[Chunk]]:
return super().__call__(docs)
def _split(self, doc: str) -> List[str]:
try:
return sent_tokenize(doc)
except LookupError:
for res in ("punkt", "punkt_tab"):
nltk.download(res)
try:
return sent_tokenize(doc)
except LookupError:
continue
raise
def _chunk(self, sentences: List[str]) -> List[Chunk]:
chunks: List[Chunk] = []
for i in range(0, len(sentences), self.target_length):
piece = sentences[i : i + self.target_length]
if len(piece) < self.target_length:
break
chunks.append(
Chunk(content=" ".join(piece), token_count=len(piece))
)
return chunksSteps:
- Drop all
try/exceptPydantic‐vs‐dataclass fallback; pick dataclasses for everything. - Remove
Configinner classes and@validatorhacks. - Use
__post_init__orfield(default_factory=…)to inject defaults (e.g.encoder). - Keep the colorama import small and mute on failure.
This preserves full behavior while cutting ~150 LOC of boilerplate.
| if v is None: | ||
| return DenseEncoder(name="default") | ||
| return v |
There was a problem hiding this comment.
suggestion (code-quality): We've found these issues:
- Lift code into else after jump in control flow (
reintroduce-else) - Replace if statement with if expression (
assign-if-exp)
| if v is None: | |
| return DenseEncoder(name="default") | |
| return v | |
| return DenseEncoder(name="default") if v is None else v |
Summary
migratecommand to create embedding tablesTesting
pytest -qpytest --cov=github_to_sqlite -qhttps://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28