Conversation
Reviewer's GuideThis PR extends the github_to_sqlite package with new utility helpers for vector embedding storage, file parsing and chunking; integrates a new CLI command for generating embeddings on starred repositories (plus a migration command); updates documentation, setup dependencies and CI; and provides comprehensive tests covering all additions. Sequence Diagram for
|
| Change | Details | Files |
|---|---|---|
| Add utility helpers for embedding storage, chunking and file parsing |
|
github_to_sqlite/utils.py |
| Introduce starred-embeddings and migration CLI commands |
|
github_to_sqlite/cli.py |
| Update setup, documentation and CI to support new features |
|
setup.pyREADME.md.github/workflows/test.ymlPLAN.md |
| Add tests covering new helpers, chunkers and CLI commands |
|
tests/test_utils_functions.pytests/test_chunk_readme.pytests/test_find_build_files.pygithub_to_sqlite/simple_chunker.pytests/test_simple_chunker.pygithub_to_sqlite/sentencizer_chunker.pytests/test_sentencizer_chunker.pytests/test_starred_embeddings_command.pytests/test_migrate_command.py |
Tips and commands
Interacting with Sourcery
- Trigger a new review: Comment
@sourcery-ai reviewon the pull request. - Continue discussions: Reply directly to Sourcery's review comments.
- Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with@sourcery-ai issueto create an issue from it. - Generate a pull request title: Write
@sourcery-aianywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai titleon the pull request to (re-)generate the title at any time. - Generate a pull request summary: Write
@sourcery-ai summaryanywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment@sourcery-ai summaryon the pull request to
(re-)generate the summary at any time. - Generate reviewer's guide: Comment
@sourcery-ai guideon the pull
request to (re-)generate the reviewer's guide at any time. - Resolve all Sourcery comments: Comment
@sourcery-ai resolveon the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore. - Dismiss all Sourcery reviews: Comment
@sourcery-ai dismisson the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai reviewto trigger a new review!
Customizing Your Experience
Access your dashboard to:
- Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others. - Change the review language.
- Add, remove or edit custom review instructions.
- Adjust other review settings.
Getting Help
- Contact our support team for questions or feedback.
- Visit our documentation for detailed guides and information.
- Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.
There was a problem hiding this comment.
Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:
- Add
import reat the top ofutils.pysochunk_readme's fallback doesn’t fail due to a missingrereference. - Instead of silently swallowing all exceptions in
parse_build_file, log or differentiate parsing errors so malformed build files don't hide failures. - Consider moving heavy libraries (
sentence-transformers,sqlite-vec,nltk,onnx,tokenizers) from coreinstall_requiresinto optional extras to reduce the base install footprint.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Add `import re` at the top of `utils.py` so `chunk_readme`'s fallback doesn’t fail due to a missing `re` reference.
- Instead of silently swallowing all exceptions in `parse_build_file`, log or differentiate parsing errors so malformed build files don't hide failures.
- Consider moving heavy libraries (`sentence-transformers`, `sqlite-vec`, `nltk`, `onnx`, `tokenizers`) from core `install_requires` into optional extras to reduce the base install footprint.
## Individual Comments
### Comment 1
<location> `github_to_sqlite/utils.py:1053` </location>
<code_context>
+ ["fd", "-HI", "-t", "f", pattern, path],
+ capture_output=True,
+ text=True,
+ check=True,
+ )
+ for line in result.stdout.splitlines():
</code_context>
<issue_to_address>
Using check=True will propagate errors on no matches
Since `fd` returns a non-zero exit code when no files match, this will raise an exception. Use `check=False` and handle the exit code to support empty results.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
result = subprocess.run(
["fd", "-HI", "-t", "f", pattern, path],
capture_output=True,
text=True,
check=True,
)
for line in result.stdout.splitlines():
line = line.strip()
if line:
found.append(os.path.normpath(line))
=======
result = subprocess.run(
["fd", "-HI", "-t", "f", pattern, path],
capture_output=True,
text=True,
check=False,
)
# fd returns exit code 1 if no files match; treat this as empty result, not error
if result.returncode not in (0, 1):
result.check_returncode()
for line in result.stdout.splitlines():
line = line.strip()
if line:
found.append(os.path.normpath(line))
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `github_to_sqlite/cli.py:730` </location>
<code_context>
+ )
+
+ # Build file metadata
+ for build_path in utils.find_build_files(repo["full_name"]):
+ metadata = utils.parse_build_file(os.path.join(repo["full_name"], build_path))
+ db["repo_build_files"].upsert(
</code_context>
<issue_to_address>
find_build_files invoked on GitHub repo name, not a local path
`find_build_files` requires a local directory, but `repo["full_name"]` is a GitHub identifier. This will fail unless the repo is cloned. Please ensure the repo is available locally or adjust the file retrieval method.
</issue_to_address>
### Comment 3
<location> `setup.py:29` </location>
<code_context>
""",
- install_requires=["sqlite-utils>=2.7.2", "requests", "PyYAML"],
- extras_require={"test": ["pytest", "requests-mock", "bs4"]},
+ install_requires=[
+ "sqlite-utils>=2.7.2",
+ "requests",
</code_context>
<issue_to_address>
Heavy dependencies in default install_requires
Consider moving large dependencies to `extras_require` to reduce the default installation footprint.
</issue_to_address>
### Comment 4
<location> `github_to_sqlite/simple_chunker.py:7` </location>
<code_context>
+from .config import config
+
+try:
+ from pydantic.v1 import BaseModel, Extra, validator
+ from colorama import Fore, Style
+ from semantic_router.encoders.base import DenseEncoder
</code_context>
<issue_to_address>
Importing from pydantic.v1 may break under Pydantic v2
Since `pydantic.v1` is not present in v2, update your imports to use `pydantic` directly or restrict the version requirement to <2.0.
</issue_to_address>
### Comment 5
<location> `tests/test_utils_functions.py:20` </location>
<code_context>
+ assert arr.tolist() == [1.0, 2.0, 3.5]
+
+
+def test_parse_build_file_json_and_toml(tmp_path):
+ json_file = tmp_path / "package.json"
+ json_file.write_text('{"name": "pkg", "author": "me"}')
</code_context>
<issue_to_address>
Test error handling for `parse_build_file`.
Add tests for cases like non-existent file paths and malformed data to ensure the function returns an empty dictionary as intended.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def test_parse_build_file_json_and_toml(tmp_path):
json_file = tmp_path / "package.json"
json_file.write_text('{"name": "pkg", "author": "me"}')
toml_file = tmp_path / "pyproject.toml"
toml_file.write_text('name = "pkg"\nauthor = "you"')
assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}
result = utils.parse_build_file(str(toml_file))
assert result["name"] == "pkg"
assert result["author"] == "you"
=======
def test_parse_build_file_json_and_toml(tmp_path):
json_file = tmp_path / "package.json"
json_file.write_text('{"name": "pkg", "author": "me"}')
toml_file = tmp_path / "pyproject.toml"
toml_file.write_text('name = "pkg"\nauthor = "you"')
assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}
result = utils.parse_build_file(str(toml_file))
assert result["name"] == "pkg"
assert result["author"] == "you"
def test_parse_build_file_error_handling(tmp_path):
# Non-existent file
non_existent_file = tmp_path / "does_not_exist.json"
assert utils.parse_build_file(str(non_existent_file)) == {}
# Malformed JSON
bad_json_file = tmp_path / "bad.json"
bad_json_file.write_text('{"name": "pkg", "author": }')
assert utils.parse_build_file(str(bad_json_file)) == {}
# Malformed TOML
bad_toml_file = tmp_path / "bad.toml"
bad_toml_file.write_text('name = "pkg"\nauthor = ')
assert utils.parse_build_file(str(bad_toml_file)) == {}
>>>>>>> REPLACE
</suggested_fix>
### Comment 6
<location> `tests/test_utils_functions.py:42` </location>
<code_context>
+ assert tree["sub"] == {"dirs": [], "files": ["file.txt"]}
+
+
+def test_maybe_load_sqlite_vec(monkeypatch):
+ db = sqlite_utils.Database(memory=True)
+
</code_context>
<issue_to_address>
Cover the case where `sqlite_vec.load()` fails.
Add a test that mocks `sqlite_vec.load` to raise an exception and verifies that `_maybe_load_sqlite_vec` returns `False`.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def test_maybe_load_sqlite_vec(monkeypatch):
db = sqlite_utils.Database(memory=True)
=======
def test_maybe_load_sqlite_vec(monkeypatch):
db = sqlite_utils.Database(memory=True)
def test_maybe_load_sqlite_vec_load_failure(monkeypatch):
# Arrange: mock sqlite_vec.load to raise an exception
import github_to_sqlite.utils as utils_mod
def mock_load(db):
raise RuntimeError("load failed")
monkeypatch.setattr(utils_mod.sqlite_vec, "load", mock_load)
# Act & Assert: _maybe_load_sqlite_vec should return False
assert utils_mod._maybe_load_sqlite_vec(db) is False
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| result = subprocess.run( | ||
| ["fd", "-HI", "-t", "f", pattern, path], | ||
| capture_output=True, | ||
| text=True, | ||
| check=True, | ||
| ) | ||
| for line in result.stdout.splitlines(): | ||
| line = line.strip() | ||
| if line: | ||
| found.append(os.path.normpath(line)) |
There was a problem hiding this comment.
suggestion (bug_risk): Using check=True will propagate errors on no matches
Since fd returns a non-zero exit code when no files match, this will raise an exception. Use check=False and handle the exit code to support empty results.
| result = subprocess.run( | |
| ["fd", "-HI", "-t", "f", pattern, path], | |
| capture_output=True, | |
| text=True, | |
| check=True, | |
| ) | |
| for line in result.stdout.splitlines(): | |
| line = line.strip() | |
| if line: | |
| found.append(os.path.normpath(line)) | |
| result = subprocess.run( | |
| ["fd", "-HI", "-t", "f", pattern, path], | |
| capture_output=True, | |
| text=True, | |
| check=False, | |
| ) | |
| # fd returns exit code 1 if no files match; treat this as empty result, not error | |
| if result.returncode not in (0, 1): | |
| result.check_returncode() | |
| for line in result.stdout.splitlines(): | |
| line = line.strip() | |
| if line: | |
| found.append(os.path.normpath(line)) |
| ) | ||
|
|
||
| # Build file metadata | ||
| for build_path in utils.find_build_files(repo["full_name"]): |
There was a problem hiding this comment.
issue (bug_risk): find_build_files invoked on GitHub repo name, not a local path
find_build_files requires a local directory, but repo["full_name"] is a GitHub identifier. This will fail unless the repo is cloned. Please ensure the repo is available locally or adjust the file retrieval method.
| """, | ||
| install_requires=["sqlite-utils>=2.7.2", "requests", "PyYAML"], | ||
| extras_require={"test": ["pytest", "requests-mock", "bs4"]}, | ||
| install_requires=[ |
There was a problem hiding this comment.
suggestion: Heavy dependencies in default install_requires
Consider moving large dependencies to extras_require to reduce the default installation footprint.
| from .config import config | ||
|
|
||
| try: | ||
| from pydantic.v1 import BaseModel, Extra, validator |
There was a problem hiding this comment.
issue (bug_risk): Importing from pydantic.v1 may break under Pydantic v2
Since pydantic.v1 is not present in v2, update your imports to use pydantic directly or restrict the version requirement to <2.0.
| def test_parse_build_file_json_and_toml(tmp_path): | ||
| json_file = tmp_path / "package.json" | ||
| json_file.write_text('{"name": "pkg", "author": "me"}') | ||
| toml_file = tmp_path / "pyproject.toml" | ||
| toml_file.write_text('name = "pkg"\nauthor = "you"') | ||
|
|
||
| assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"} | ||
| result = utils.parse_build_file(str(toml_file)) | ||
| assert result["name"] == "pkg" | ||
| assert result["author"] == "you" |
There was a problem hiding this comment.
suggestion (testing): Test error handling for parse_build_file.
Add tests for cases like non-existent file paths and malformed data to ensure the function returns an empty dictionary as intended.
| def test_parse_build_file_json_and_toml(tmp_path): | |
| json_file = tmp_path / "package.json" | |
| json_file.write_text('{"name": "pkg", "author": "me"}') | |
| toml_file = tmp_path / "pyproject.toml" | |
| toml_file.write_text('name = "pkg"\nauthor = "you"') | |
| assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"} | |
| result = utils.parse_build_file(str(toml_file)) | |
| assert result["name"] == "pkg" | |
| assert result["author"] == "you" | |
| def test_parse_build_file_json_and_toml(tmp_path): | |
| json_file = tmp_path / "package.json" | |
| json_file.write_text('{"name": "pkg", "author": "me"}') | |
| toml_file = tmp_path / "pyproject.toml" | |
| toml_file.write_text('name = "pkg"\nauthor = "you"') | |
| assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"} | |
| result = utils.parse_build_file(str(toml_file)) | |
| assert result["name"] == "pkg" | |
| assert result["author"] == "you" | |
| def test_parse_build_file_error_handling(tmp_path): | |
| # Non-existent file | |
| non_existent_file = tmp_path / "does_not_exist.json" | |
| assert utils.parse_build_file(str(non_existent_file)) == {} | |
| # Malformed JSON | |
| bad_json_file = tmp_path / "bad.json" | |
| bad_json_file.write_text('{"name": "pkg", "author": }') | |
| assert utils.parse_build_file(str(bad_json_file)) == {} | |
| # Malformed TOML | |
| bad_toml_file = tmp_path / "bad.toml" | |
| bad_toml_file.write_text('name = "pkg"\nauthor = ') | |
| assert utils.parse_build_file(str(bad_toml_file)) == {} |
| if "repo_metadata" not in tables: | ||
| db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id") |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| default="auth.json", | ||
| help="Path to auth.json token file", | ||
| ) | ||
| def starred_embeddings(db_path, model, force, verbose, auth): |
There was a problem hiding this comment.
issue (code-quality): Low code quality found in starred_embeddings - 17% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| if v is None: | ||
| return DenseEncoder(name="default") | ||
| return v |
There was a problem hiding this comment.
suggestion (code-quality): We've found these issues:
- Lift code into else after jump in control flow (
reintroduce-else) - Replace if statement with if expression (
assign-if-exp)
| if v is None: | |
| return DenseEncoder(name="default") | |
| return v | |
| return DenseEncoder(name="default") if v is None else v |
| check=True, | ||
| ) | ||
| for line in result.stdout.splitlines(): | ||
| line = line.strip() |
There was a problem hiding this comment.
issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)
| if cmd == 'find': | ||
| return '/usr/bin/find' | ||
| return None | ||
|
|
||
| def fake_run(args, capture_output, text, check): | ||
| pattern = args[3] | ||
| calls.append((args[0], pattern)) | ||
| class R: | ||
| def __init__(self): | ||
| self.stdout = 'repo/' + pattern + '\n' |
There was a problem hiding this comment.
issue (code-quality): We've found these issues:
- Lift code into else after jump in control flow (
reintroduce-else) - Replace if statement with if expression (
assign-if-exp) - Use f-string instead of string concatenation (
use-fstring-for-concatenation)
There was a problem hiding this comment.
Pull Request Overview
Adds missing tests for core utilities, expands setup dependencies, and implements embedding support with a new CLI command and documentation updates.
- Add CLI command
starred-embeddingswith optional sqlite-vec support and directory metadata. - Introduce utility helpers:
vector_to_blob,parse_build_file,directory_tree, and optional sqlite-vec loading. - Expand tests for migration, build-file discovery, database shape, config loading, and readme chunking.
Reviewed Changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_migrate_command.py | Test that migrate CLI creates expected embedding tables |
| tests/test_find_build_files.py | Tests for find_build_files() using fd, find, and walk |
| tests/test_embedding_tables.py | Verify embedding tables are created by ensure_db_shape() |
| tests/test_config.py | Test loading tokenizer via env var and tokenizers monkeypatch |
| tests/test_chunk_readme.py | Tests for chunk_readme() fallback and optional chunker import |
| setup.py | Add runtime and extra dependencies for embeddings and docs |
| github_to_sqlite/utils.py | Add embedding table creation, sqlite-vec loader, helpers |
| github_to_sqlite/tokenization.py | Tokenizer loading with env/CLI fallback |
| github_to_sqlite/simple_chunker.py | Implement a Pydantic-based chunker with NLTK fallback |
| github_to_sqlite/sentencizer_chunker.py | Basic chunker splitting on period tokens |
| github_to_sqlite/config.py | Add pydantic-based default config values |
| github_to_sqlite/cli.py | Add starred-embeddings command with vector storage logic |
| docs/*.rst | Documentation updates for migrations, embeddings, and index |
| .github/workflows/test.yml | CI updated to run coverage and build docs |
Comments suppressed due to low confidence (2)
github_to_sqlite/utils.py:1087
- There are no unit tests covering
vector_to_blob,parse_build_file,directory_tree, or_maybe_load_sqlite_vec. Consider adding tests for these helpers to ensure reliable behavior.
def vector_to_blob(vec) -> bytes:
github_to_sqlite/utils.py:751
- [nitpick] Embedding dimension is hard-coded to 768; if a different model is used, this may cause a mismatch. Consider making the vector size configurable or detecting it at runtime.
title_embedding float[768],
| except Exception: | ||
| _SQLITE_VEC_LOADED = False | ||
| except Exception: |
There was a problem hiding this comment.
Catching all exceptions here can mask unrelated errors; consider catching ImportError and any specific exceptions raised by sqlite_vec.load instead.
| except Exception: | |
| _SQLITE_VEC_LOADED = False | |
| except Exception: | |
| except sqlite_vec.LoadError: | |
| _SQLITE_VEC_LOADED = False | |
| except ImportError: |
| return json.load(fp) | ||
| with open(path, "rb") as fp: | ||
| return tomllib.load(fp) | ||
| except Exception: |
There was a problem hiding this comment.
This broad except silently ignores all errors when parsing build files; narrowing to json.JSONDecodeError, tomllib.TOMLDecodeError, and FileNotFoundError would avoid masking other issues.
| except Exception: | |
| except (json.JSONDecodeError, tomllib.TOMLDecodeError, FileNotFoundError) as e: | |
| print(f"Error parsing build file {path}: {e}") |
| click.echo("sqlite-vec extension not loaded; storing embeddings as BLOBs") | ||
|
|
||
| env_model = os.environ.get("GITHUB_TO_SQLITE_MODEL") | ||
| model_name = model or env_model or config.config.default_model |
There was a problem hiding this comment.
[nitpick] Referencing config.config.default_model is confusing because both the module and instance are named config. Import the instance directly (from github_to_sqlite.config import config) to use config.default_model.
| model_name = model or env_model or config.config.default_model | |
| model_name = model or env_model or config.default_model |
| import os | ||
| import subprocess | ||
| import shutil | ||
|
|
||
|
|
There was a problem hiding this comment.
[nitpick] Imports for os, subprocess, and shutil are mid-file; consider grouping all imports at the top of the module for clarity.
| import os | |
| import subprocess | |
| import shutil |
Summary
starred-embeddingstests to fake the sqlite-vec moduleTesting
pytest -qpytest --cov=github_to_sqlite --cov-branch -qhttps://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28
Summary by Sourcery
Add coverage for utility helpers and embedding features, introduce new CLI commands for database migrations and starred repository embeddings, integrate sqlite-vec for vector storage, and update dependencies, documentation, and CI to support these changes.
New Features:
Enhancements:
CI:
Documentation:
Tests: