Conversation
Reviewer's GuideThis PR implements a complete embeddings pipeline for user-starred repositories: it extends the database layer to conditionally load and manage vector tables, adds new utility modules for chunking and tokenization, introduces a starred-embeddings CLI command with model selection and force/verbose flags, updates dependencies and CI to support the feature, and delivers thorough unit and integration tests alongside updated documentation and a phased plan. Sequence Diagram for
|
| Change | Details | Files |
|---|---|---|
| Load sqlite-vec extension and define embedding tables |
|
github_to_sqlite/utils.py |
| Introduce new modules for chunking, tokenization, and configuration |
|
github_to_sqlite/simple_chunker.pygithub_to_sqlite/sentencizer_chunker.pygithub_to_sqlite/tokenization.pygithub_to_sqlite/config.py |
| Add starred-embeddings CLI command |
|
github_to_sqlite/cli.py |
| Update project setup and CI for embedding support |
|
setup.pyREADME.md.github/workflows/test.yml |
| Add unit and integration tests for new features |
|
tests/test_repos.pytests/test_starred.pytests/test_starred_embeddings_command.pytests/test_starred_embeddings_integration.pytests/test_utils_functions.pytests/test_find_build_files.pytests/test_simple_chunker.pytests/test_sentencizer_chunker.pytests/test_chunk_readme.pytests/test_config.pytests/test_embedding_tables.pytests/test_migrate_command.py |
| Provide detailed plan and documentation for the embeddings feature |
|
PLAN.mddocs/conf.pydocs/index.rstdocs/embeddings.rstdocs/migrations.rstAGENTS.md |
Tips and commands
Interacting with Sourcery
- Trigger a new review: Comment
@sourcery-ai reviewon the pull request. - Continue discussions: Reply directly to Sourcery's review comments.
- Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with@sourcery-ai issueto create an issue from it. - Generate a pull request title: Write
@sourcery-aianywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai titleon the pull request to (re-)generate the title at any time. - Generate a pull request summary: Write
@sourcery-ai summaryanywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment@sourcery-ai summaryon the pull request to
(re-)generate the summary at any time. - Generate reviewer's guide: Comment
@sourcery-ai guideon the pull
request to (re-)generate the reviewer's guide at any time. - Resolve all Sourcery comments: Comment
@sourcery-ai resolveon the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore. - Dismiss all Sourcery reviews: Comment
@sourcery-ai dismisson the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai reviewto trigger a new review!
Customizing Your Experience
Access your dashboard to:
- Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others. - Change the review language.
- Add, remove or edit custom review instructions.
- Adjust other review settings.
Getting Help
- Contact our support team for questions or feedback.
- Visit our documentation for detailed guides and information.
- Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.
There was a problem hiding this comment.
Pull Request Overview
Implements the starred-embeddings feature end-to-end and adds supporting unit tests, CI updates, and documentation.
- Introduce utility functions for embedding tables, build-file parsing, README chunking, and vector serialization.
- Add a new CLI command
starred-embeddingswith DB migrations, schema updates, and optional sqlite-vec support. - Update dependencies, tests, CI workflow, and docs (RST and README) to cover the new embedding functionality.
Reviewed Changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_find_build_files.py | Unit tests for the new find_build_files helper. |
| tests/test_embedding_tables.py | Verifies creation of embedding tables via ensure_db_shape. |
| tests/test_config.py | Tests tokenizer loading with env var and HF from_pretrained. |
| tests/test_chunk_readme.py | Tests semantic-chunking fallback and blank-line split. |
| setup.py | Adds sentence-transformers, sqlite-vec, semantic-chunkers, etc. |
| github_to_sqlite/utils.py | Adds _maybe_load_sqlite_vec, ensure_embedding_tables, parsing, etc. |
| github_to_sqlite/cli.py | New starred-embeddings command, type casts, and imports. |
| docs/embeddings.rst, migrations.rst | Document the new embeddings command and migrations. |
| .github/workflows/test.yml | CI now runs pytest‐cov, mypy, ruff, and builds docs. |
Comments suppressed due to low confidence (1)
.github/workflows/test.yml:22
- [nitpick] You already include
mypyandruffin the[test]extras, so this manual install is redundant. You can remove it to simplify the CI step.
pip install mypy ruff
| ) | ||
|
|
||
| # Build file metadata | ||
| for build_path in utils.find_build_files(repo["full_name"]): |
There was a problem hiding this comment.
This attempts to scan local filesystem paths named by repo["full_name"], but the repo directory isn’t cloned locally. You’ll need to clone or download the repo contents before calling find_build_files, or switch to using the GitHub contents API.
| return arr.tobytes() | ||
|
|
||
|
|
||
| def parse_build_file(path: str) -> dict: |
There was a problem hiding this comment.
On Python 3.10, import tomllib will fail (it’s only in 3.11+), causing .toml files to always return {}. Consider importing tomli as a fallback when tomllib isn’t available.
There was a problem hiding this comment.
Hey @JeffCarpenter - I've reviewed your changes and they look great!
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if cmd == 'find': | ||
| return '/usr/bin/find' |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if "repo_embeddings" not in tables: | ||
| db["repo_embeddings"].create({"repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes}, pk="repo_id") |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if "readme_chunk_embeddings" not in tables: | ||
| db["readme_chunk_embeddings"].create({"repo_id": int, "chunk_index": int, "chunk_text": str, "embedding": bytes}, pk=("repo_id", "chunk_index")) |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if "repo_build_files" not in tables: | ||
| db["repo_build_files"].create({"repo_id": int, "file_path": str, "metadata": str}, pk=("repo_id", "file_path")) |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if "repo_metadata" not in tables: | ||
| db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id") |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| default="auth.json", | ||
| help="Path to auth.json token file", | ||
| ) | ||
| def starred_embeddings(db_path, model, force, verbose, auth): |
There was a problem hiding this comment.
issue (code-quality): Low code quality found in starred_embeddings - 17% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| from bs4 import BeautifulSoup | ||
|
|
||
| url = "https://github.com/{}/network/dependents".format(repo) | ||
| url: str | None = "https://github.com/{}/network/dependents".format(repo) |
There was a problem hiding this comment.
suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)
| url: str | None = "https://github.com/{}/network/dependents".format(repo) | |
| url: str | None = f"https://github.com/{repo}/network/dependents" |
| check=True, | ||
| ) | ||
| for line in result.stdout.splitlines(): | ||
| line = line.strip() |
There was a problem hiding this comment.
issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)
| if cmd == 'find': | ||
| return '/usr/bin/find' | ||
| return None | ||
|
|
||
| def fake_run(args, capture_output, text, check): | ||
| pattern = args[3] | ||
| calls.append((args[0], pattern)) | ||
| class R: | ||
| def __init__(self): | ||
| self.stdout = 'repo/' + pattern + '\n' |
There was a problem hiding this comment.
issue (code-quality): We've found these issues:
- Lift code into else after jump in control flow (
reintroduce-else) - Replace if statement with if expression (
assign-if-exp) - Use f-string instead of string concatenation (
use-fstring-for-concatenation)
Summary
starred-embeddingsTesting
ruff check --fix .pytest -qpytest --cov=github_to_sqlite --cov-branch -qmypy github_to_sqlitehttps://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28
Summary by Sourcery
Add a new end-to-end starred embeddings feature and supporting infrastructure: introduce the
starred-embeddingsandmigratecommands, create embedding and metadata tables with sqlite-vec support, add utilities for chunking and build file discovery, update dependencies, CI, documentation, and comprehensive tests.New Features:
starred-embeddingsCLI command to compute and store embeddings for user-starred repositoriesmigratecommand and embedding tables (repo_embeddings,readme_chunk_embeddings,repo_build_files,repo_metadata) with optional sqlite-vec supportEnhancements:
Build:
sentence-transformers,sqlite-vec,nltk,onnx,pydantic,tokenizersand updateextras_requirefor tests, docs, and GPUCI:
Documentation:
Tests:
starred-embeddingsend-to-end