Add comprehensive integration test for embeddings#6
Add comprehensive integration test for embeddings#6JeffCarpenter wants to merge 1 commit intomainfrom
Conversation
Reviewer's GuideThis PR expands the embeddings feature with full end-to-end support and testing by loading the sqlite-vec extension, creating new embedding and metadata tables, adding a Click command to generate and store starred-repo embeddings, introducing helper utilities for chunking and file parsing, updating dependencies, CI, and docs, and providing comprehensive unit and integration tests to verify stored vectors and metadata. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Pull Request Overview
This PR implements the end-to-end embeddings feature and adds corresponding tests, CLI commands, documentation updates, and dependency changes.
- Introduces embedding storage tables (
repo_embeddings,readme_chunk_embeddings), build-file metadata, andensure_embedding_tables - Adds new CLI command
starred-embeddingsplus updatedsetup.py, CI workflows, and docs - Expands unit tests for utilities (
find_build_files,chunk_readme,ensure_db_shape, etc.)
Reviewed Changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| github_to_sqlite/utils.py | Added embedding table creation, chunking & build-file helpers |
| github_to_sqlite/tokenization.py | Exposed load_tokenizer using HF Tokenizer |
| github_to_sqlite/simple_chunker.py | New simple chunker implementation |
| github_to_sqlite/sentencizer_chunker.py | Added BasicSentencizerChunker |
| github_to_sqlite/config.py | New Pydantic config defaults |
| github_to_sqlite/cli.py | New starred-embeddings command and type casting |
| setup.py | Updated dependencies and extras |
| mypy.ini | Added strict type‐checking flags |
| tests/ | New unit tests for all new utils and helpers |
| docs/ | Added RST docs for embeddings and migrations |
| PLAN.md, AGENTS.md | Marked embedding plan complete and notes |
| .github/workflows/test.yml | CI now runs ruff, mypy, docs build |
| click.echo("sqlite-vec extension not loaded; storing embeddings as BLOBs") | ||
|
|
||
| env_model = os.environ.get("GITHUB_TO_SQLITE_MODEL") | ||
| model_name = model or env_model or config.config.default_model |
There was a problem hiding this comment.
The reference config.config.default_model is incorrect—config is the instance imported from config.py, so it should be config.default_model.
| model_name = model or env_model or config.config.default_model | |
| model_name = model or env_model or config.default_model |
| calls.append(model) | ||
| return 'tok' | ||
|
|
||
| monkeypatch.setattr('tokenizers.Tokenizer.from_pretrained', fake_from_pretrained) |
There was a problem hiding this comment.
The monkeypatch.setattr target should be a module attribute, not a string—replace with e.g. monkeypatch.setattr(tokenization.Tokenizer, 'from_pretrained', fake_from_pretrained) or patch Tokenizer.from_pretrained on the actual imported class.
| monkeypatch.setattr('tokenizers.Tokenizer.from_pretrained', fake_from_pretrained) | |
| monkeypatch.setattr(Tokenizer, 'from_pretrained', fake_from_pretrained) |
There was a problem hiding this comment.
Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:
- In ensure_embedding_tables you have nearly identical logic for creating non-vec tables—consider extracting a small helper (e.g. create_table_if_missing) to reduce duplication and make future additions easier.
- find_build_files’ Python fallback could be simplified by using pathlib.Path(path).rglob(pattern) instead of manual os.walk and subprocess calls, making it more cross-platform and easier to maintain.
- The starred-embeddings command currently encodes each repo’s chunks in one shot; for users with many stars you might batch or stream encode calls to avoid blowing up memory or exceeding model input limits.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In ensure_embedding_tables you have nearly identical logic for creating non-vec tables—consider extracting a small helper (e.g. create_table_if_missing) to reduce duplication and make future additions easier.
- find_build_files’ Python fallback could be simplified by using pathlib.Path(path).rglob(pattern) instead of manual os.walk and subprocess calls, making it more cross-platform and easier to maintain.
- The starred-embeddings command currently encodes each repo’s chunks in one shot; for users with many stars you might batch or stream encode calls to avoid blowing up memory or exceeding model input limits.
## Individual Comments
### Comment 1
<location> `github_to_sqlite/utils.py:1053` </location>
<code_context>
+
+ if shutil.which("fd"):
+ for pattern in BUILD_PATTERNS:
+ result = subprocess.run(
+ ["fd", "-HI", "-t", "f", pattern, path],
+ capture_output=True,
</code_context>
<issue_to_address>
subprocess.run(check=True) can raise and crash
Wrap the call in try/except to handle CalledProcessError and ensure the program handles failures gracefully.
</issue_to_address>
### Comment 2
<location> `docs/migrations.rst:12` </location>
<code_context>
+
+This sets up the ``repo_embeddings``, ``readme_chunk_embeddings``, ``repo_build_files`` and ``repo_metadata`` tables. The helper prefers the ``sqlite-vec`` extension when available.
+
+Some commands look for standard build definitions such as ``pyproject.toml`` or ``package.json``. The ``find_build_files()`` helper uses the ``fd`` command if installed, otherwise falling back to ``find`` or a Python implementation.
</code_context>
<issue_to_address>
Consider adjusting the presentation of the build file detection paragraph.
Clarify the relationship between this paragraph and the earlier discussion of the `migrate` command, or consider making it a separate sub-topic for improved readability.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
|
||
| if shutil.which("fd"): | ||
| for pattern in BUILD_PATTERNS: | ||
| result = subprocess.run( |
There was a problem hiding this comment.
issue (bug_risk): subprocess.run(check=True) can raise and crash
Wrap the call in try/except to handle CalledProcessError and ensure the program handles failures gracefully.
|
|
||
| This sets up the ``repo_embeddings``, ``readme_chunk_embeddings``, ``repo_build_files`` and ``repo_metadata`` tables. The helper prefers the ``sqlite-vec`` extension when available. | ||
|
|
||
| Some commands look for standard build definitions such as ``pyproject.toml`` or ``package.json``. The ``find_build_files()`` helper uses the ``fd`` command if installed, otherwise falling back to ``find`` or a Python implementation. |
There was a problem hiding this comment.
suggestion: Consider adjusting the presentation of the build file detection paragraph.
Clarify the relationship between this paragraph and the earlier discussion of the migrate command, or consider making it a separate sub-topic for improved readability.
| if cmd == 'find': | ||
| return '/usr/bin/find' |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if "repo_embeddings" not in tables: | ||
| db["repo_embeddings"].create({"repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes}, pk="repo_id") |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if "readme_chunk_embeddings" not in tables: | ||
| db["readme_chunk_embeddings"].create({"repo_id": int, "chunk_index": int, "chunk_text": str, "embedding": bytes}, pk=("repo_id", "chunk_index")) |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if "repo_metadata" not in tables: | ||
| db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id") |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| default="auth.json", | ||
| help="Path to auth.json token file", | ||
| ) | ||
| def starred_embeddings(db_path, model, force, verbose, auth): |
There was a problem hiding this comment.
issue (code-quality): Low code quality found in starred_embeddings - 17% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| from bs4 import BeautifulSoup | ||
|
|
||
| url = "https://github.com/{}/network/dependents".format(repo) | ||
| url: str | None = "https://github.com/{}/network/dependents".format(repo) |
There was a problem hiding this comment.
suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)
| url: str | None = "https://github.com/{}/network/dependents".format(repo) | |
| url: str | None = f"https://github.com/{repo}/network/dependents" |
| check=True, | ||
| ) | ||
| for line in result.stdout.splitlines(): | ||
| line = line.strip() |
There was a problem hiding this comment.
issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)
| if cmd == 'find': | ||
| return '/usr/bin/find' | ||
| return None | ||
|
|
||
| def fake_run(args, capture_output, text, check): | ||
| pattern = args[3] | ||
| calls.append((args[0], pattern)) | ||
| class R: | ||
| def __init__(self): | ||
| self.stdout = 'repo/' + pattern + '\n' |
There was a problem hiding this comment.
issue (code-quality): We've found these issues:
- Lift code into else after jump in control flow (
reintroduce-else) - Replace if statement with if expression (
assign-if-exp) - Use f-string instead of string concatenation (
use-fstring-for-concatenation)
Summary
test_starred_embeddings_integrationto verify stored dataTesting
ruff check --fix .pytest -qpytest --cov=github_to_sqlite --cov-branch -qmypy github_to_sqlite(failed: command produced no output)https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28
Summary by Sourcery
Add a new embeddings feature for starred repositories, including database schema changes, CLI commands, utility functions, documentation, and tests to support generating, storing, and migrating vector embeddings for repo titles, descriptions, README chunks, and build metadata.
New Features:
starred-embeddingsCLI command to compute and store embeddings for user-starred repositoriesrepo_embeddings,readme_chunk_embeddings,repo_build_files,repo_metadata) with optional sqlite-vec supportmigrateCLI command to initialize embedding tables and other optional schema elementsEnhancements:
ensure_db_shapeBuild:
setup.pyto include sentence-transformers, sqlite-vec, semantic-chunkers, docs and testing toolsCI:
Documentation:
PLAN.mdand Sphinx documentation for embeddings, migrations, and build file detectionTests: