Add coverage for utility helpers by JeffCarpenter · Pull Request #3 · JeffCarpenter/github-to-sqlite

JeffCarpenter · 2025-06-12T22:21:44Z

Summary

add missing tests for vector_to_blob, parse_build_file, directory_tree, and sqlite-vec loading
patch starred-embeddings tests to fake the sqlite-vec module
note the new tests in PLAN

Testing

pytest -q
pytest --cov=github_to_sqlite --cov-branch -q

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Summary by Sourcery

Add coverage for utility helpers and embedding features, introduce new CLI commands for database migrations and starred repository embeddings, integrate sqlite-vec for vector storage, and update dependencies, documentation, and CI to support these changes.

New Features:

Add starred-embeddings CLI command to compute and store embeddings for starred repositories
Add migrate CLI command to create optional tables, views, and indexes
Introduce utility functions for build file detection, vector conversion, build file parsing, directory tree generation, and README chunking
Add support for loading a Hugging Face tokenizer via load_tokenizer

Enhancements:

Integrate optional sqlite-vec extension for efficient vector storage and search
Ensure embedding tables are created automatically during database setup
Extend configuration with default model and ONNX provider settings
Implement BasicSentencizerChunker and SimpleChunker for text chunking

CI:

Add new runtime and test dependencies in setup.py and extras_require
Update CI workflow to install docs dependencies, run tests with coverage, and build Sphinx docs

Documentation:

Update README with migration and starred-embeddings usage
Add Sphinx configuration and initial docs for embeddings and migrations

Tests:

Add unit tests for utility helpers (vector_to_blob, find_build_files, parse_build_file, directory_tree, chunk_readme, _maybe_load_sqlite_vec)
Add tests for new CLI commands (starred-embeddings, migrate) and chunker classes
Add pytest-cov for coverage reporting

sourcery-ai · 2025-06-12T22:22:05Z

Reviewer's Guide

This PR extends the github_to_sqlite package with new utility helpers for vector embedding storage, file parsing and chunking; integrates a new CLI command for generating embeddings on starred repositories (plus a migration command); updates documentation, setup dependencies and CI; and provides comprehensive tests covering all additions.

Sequence Diagram for `starred-embeddings` Command

sequenceDiagram
    actor User
    participant CLI as github-to-sqlite
    participant GitHub
    participant SentenceTransformer
    participant DB as SQLite Database

    User->>CLI: executes `starred-embeddings`
    CLI->>GitHub: fetch_all_starred(token)
    GitHub-->>CLI: returns starred repos
    loop for each repo
        CLI->>GitHub: fetch_readme(repo)
        GitHub-->>CLI: returns README content
        CLI->>CLI: chunk_readme(readme)
        CLI->>SentenceTransformer: encode(texts)
        SentenceTransformer-->>CLI: returns vectors
        CLI->>DB: upsert repo_embeddings
        CLI->>DB: upsert readme_chunk_embeddings
        CLI->>CLI: find_build_files(repo_path)
        CLI->>DB: upsert repo_build_files
        CLI->>DB: upsert repo_metadata
    end

Entity Relationship Diagram for New Embedding Tables

erDiagram
    repos {
        int id PK
        string full_name
    }
    repo_embeddings {
        int repo_id PK, FK
        bytes title_embedding
        bytes description_embedding
        bytes readme_embedding
    }
    readme_chunk_embeddings {
        int repo_id PK, FK
        int chunk_index PK
        string chunk_text
        bytes embedding
    }
    repo_build_files {
        int repo_id PK, FK
        string file_path PK
        string metadata
    }
    repo_metadata {
        int repo_id PK, FK
        string language
        string directory_tree
    }

    repos ||--o{ repo_embeddings : "has one"
    repos ||--|{ readme_chunk_embeddings : "has many"
    repos ||--|{ repo_build_files : "has many"
    repos ||--o{ repo_metadata : "has one"

Class Diagram for Chunker Implementations

classDiagram
    direction LR

    class BaseChunker {
        <<Abstract>>
        +name: str
        +splitter: BaseSplitter
        +__call__(docs: List[str]) List[List[Chunk]]
    }

    class SimpleChunker {
        +target_length: int
        +__call__(docs: List[str]) List[List[Chunk]]
        +_split(doc: str) List[str]
        +_chunk(sentences: List[str]) List[Chunk]
    }

    class BasicSentencizerChunker {
        +period_token: str
        +chunk(tokens, vectors) List[List[Any]]
    }

    class Chunk {
        <<Dataclass>>
        +content: str
        +token_count: int
        +is_triggered: bool
        +triggered_score: float
    }

    BaseChunker <|-- SimpleChunker
    SimpleChunker ..> Chunk : creates

File-Level Changes

Change	Details	Files
Add utility helpers for embedding storage, chunking and file parsing	Implement _maybe_load_sqlite_vec to detect and load the sqlite-vec extension Add ensure_embedding_tables to create embedding and metadata tables Introduce chunk_readme with optional semantic-chunkers fallback Add find_build_files using fd/find/os.walk in preference order Implement vector_to_blob, parse_build_file, and directory_tree helpers	`github_to_sqlite/utils.py`
Introduce starred-embeddings and migration CLI commands	Define 'starred-embeddings' command to fetch stars, chunk READMEs, run embeddings and upsert results Implement '--model', '--force' and '--verbose' options in CLI Add 'migrate' command to bootstrap optional embedding tables and indexes	`github_to_sqlite/cli.py`
Update setup, documentation and CI to support new features	Extend setup.py with embedding, chunker and tokenizer dependencies and extras Add README sections for migrations, build file detection and the new embeddings command Update GitHub Actions workflow to install docs deps, run coverage and build docs Add PLAN.md to document the embedding feature rollout	`setup.py` `README.md` `.github/workflows/test.yml` `PLAN.md`
Add tests covering new helpers, chunkers and CLI commands	Unit tests for vector_to_blob, parse_build_file, directory_tree and _maybe_load_sqlite_vec Tests for chunk_readme fallback and semantic-chunkers integration Tests for find_build_files under fd/find/walk scenarios Tests for SimpleChunker and BasicSentencizerChunker behavior Integration tests for starred-embeddings and migrate commands	`tests/test_utils_functions.py` `tests/test_chunk_readme.py` `tests/test_find_build_files.py` `github_to_sqlite/simple_chunker.py` `tests/test_simple_chunker.py` `github_to_sqlite/sentencizer_chunker.py` `tests/test_sentencizer_chunker.py` `tests/test_starred_embeddings_command.py` `tests/test_migrate_command.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:

Add import re at the top of utils.py so chunk_readme's fallback doesn’t fail due to a missing re reference.
Instead of silently swallowing all exceptions in parse_build_file, log or differentiate parsing errors so malformed build files don't hide failures.
Consider moving heavy libraries (sentence-transformers, sqlite-vec, nltk, onnx, tokenizers) from core install_requires into optional extras to reduce the base install footprint.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Add `import re` at the top of `utils.py` so `chunk_readme`'s fallback doesn’t fail due to a missing `re` reference.
- Instead of silently swallowing all exceptions in `parse_build_file`, log or differentiate parsing errors so malformed build files don't hide failures.
- Consider moving heavy libraries (`sentence-transformers`, `sqlite-vec`, `nltk`, `onnx`, `tokenizers`) from core `install_requires` into optional extras to reduce the base install footprint.

## Individual Comments

### Comment 1
<location> `github_to_sqlite/utils.py:1053` </location>
<code_context>
+                ["fd", "-HI", "-t", "f", pattern, path],
+                capture_output=True,
+                text=True,
+                check=True,
+            )
+            for line in result.stdout.splitlines():
</code_context>

<issue_to_address>
Using check=True will propagate errors on no matches

Since `fd` returns a non-zero exit code when no files match, this will raise an exception. Use `check=False` and handle the exit code to support empty results.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
            result = subprocess.run(
                ["fd", "-HI", "-t", "f", pattern, path],
                capture_output=True,
                text=True,
                check=True,
            )
            for line in result.stdout.splitlines():
                line = line.strip()
                if line:
                    found.append(os.path.normpath(line))
=======
            result = subprocess.run(
                ["fd", "-HI", "-t", "f", pattern, path],
                capture_output=True,
                text=True,
                check=False,
            )
            # fd returns exit code 1 if no files match; treat this as empty result, not error
            if result.returncode not in (0, 1):
                result.check_returncode()
            for line in result.stdout.splitlines():
                line = line.strip()
                if line:
                    found.append(os.path.normpath(line))
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `github_to_sqlite/cli.py:730` </location>
<code_context>
+            )
+
+        # Build file metadata
+        for build_path in utils.find_build_files(repo["full_name"]):
+            metadata = utils.parse_build_file(os.path.join(repo["full_name"], build_path))
+            db["repo_build_files"].upsert(
</code_context>

<issue_to_address>
find_build_files invoked on GitHub repo name, not a local path

`find_build_files` requires a local directory, but `repo["full_name"]` is a GitHub identifier. This will fail unless the repo is cloned. Please ensure the repo is available locally or adjust the file retrieval method.
</issue_to_address>

### Comment 3
<location> `setup.py:29` </location>
<code_context>
     """,
-    install_requires=["sqlite-utils>=2.7.2", "requests", "PyYAML"],
-    extras_require={"test": ["pytest", "requests-mock", "bs4"]},
+    install_requires=[
+        "sqlite-utils>=2.7.2",
+        "requests",
</code_context>

<issue_to_address>
Heavy dependencies in default install_requires

Consider moving large dependencies to `extras_require` to reduce the default installation footprint.
</issue_to_address>

### Comment 4
<location> `github_to_sqlite/simple_chunker.py:7` </location>
<code_context>
+from .config import config
+
+try:
+    from pydantic.v1 import BaseModel, Extra, validator
+    from colorama import Fore, Style
+    from semantic_router.encoders.base import DenseEncoder
</code_context>

<issue_to_address>
Importing from pydantic.v1 may break under Pydantic v2

Since `pydantic.v1` is not present in v2, update your imports to use `pydantic` directly or restrict the version requirement to <2.0.
</issue_to_address>

### Comment 5
<location> `tests/test_utils_functions.py:20` </location>
<code_context>
+    assert arr.tolist() == [1.0, 2.0, 3.5]
+
+
+def test_parse_build_file_json_and_toml(tmp_path):
+    json_file = tmp_path / "package.json"
+    json_file.write_text('{"name": "pkg", "author": "me"}')
</code_context>

<issue_to_address>
Test error handling for `parse_build_file`.

Add tests for cases like non-existent file paths and malformed data to ensure the function returns an empty dictionary as intended.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_parse_build_file_json_and_toml(tmp_path):
    json_file = tmp_path / "package.json"
    json_file.write_text('{"name": "pkg", "author": "me"}')
    toml_file = tmp_path / "pyproject.toml"
    toml_file.write_text('name = "pkg"\nauthor = "you"')

    assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}
    result = utils.parse_build_file(str(toml_file))
    assert result["name"] == "pkg"
    assert result["author"] == "you"
=======
def test_parse_build_file_json_and_toml(tmp_path):
    json_file = tmp_path / "package.json"
    json_file.write_text('{"name": "pkg", "author": "me"}')
    toml_file = tmp_path / "pyproject.toml"
    toml_file.write_text('name = "pkg"\nauthor = "you"')

    assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}
    result = utils.parse_build_file(str(toml_file))
    assert result["name"] == "pkg"
    assert result["author"] == "you"

def test_parse_build_file_error_handling(tmp_path):
    # Non-existent file
    non_existent_file = tmp_path / "does_not_exist.json"
    assert utils.parse_build_file(str(non_existent_file)) == {}

    # Malformed JSON
    bad_json_file = tmp_path / "bad.json"
    bad_json_file.write_text('{"name": "pkg", "author": }')
    assert utils.parse_build_file(str(bad_json_file)) == {}

    # Malformed TOML
    bad_toml_file = tmp_path / "bad.toml"
    bad_toml_file.write_text('name = "pkg"\nauthor = ')
    assert utils.parse_build_file(str(bad_toml_file)) == {}
>>>>>>> REPLACE

</suggested_fix>

### Comment 6
<location> `tests/test_utils_functions.py:42` </location>
<code_context>
+    assert tree["sub"] == {"dirs": [], "files": ["file.txt"]}
+
+
+def test_maybe_load_sqlite_vec(monkeypatch):
+    db = sqlite_utils.Database(memory=True)
+
</code_context>

<issue_to_address>
Cover the case where `sqlite_vec.load()` fails.

Add a test that mocks `sqlite_vec.load` to raise an exception and verifies that `_maybe_load_sqlite_vec` returns `False`.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_maybe_load_sqlite_vec(monkeypatch):
    db = sqlite_utils.Database(memory=True)

=======
def test_maybe_load_sqlite_vec(monkeypatch):
    db = sqlite_utils.Database(memory=True)

def test_maybe_load_sqlite_vec_load_failure(monkeypatch):
    # Arrange: mock sqlite_vec.load to raise an exception
    import github_to_sqlite.utils as utils_mod

    def mock_load(db):
        raise RuntimeError("load failed")

    monkeypatch.setattr(utils_mod.sqlite_vec, "load", mock_load)
    # Act & Assert: _maybe_load_sqlite_vec should return False
    assert utils_mod._maybe_load_sqlite_vec(db) is False
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-06-12T22:24:37Z

github_to_sqlite/utils.py

+            result = subprocess.run(
+                ["fd", "-HI", "-t", "f", pattern, path],
+                capture_output=True,
+                text=True,
+                check=True,
+            )
+            for line in result.stdout.splitlines():
+                line = line.strip()
+                if line:
+                    found.append(os.path.normpath(line))


suggestion (bug_risk): Using check=True will propagate errors on no matches

Since fd returns a non-zero exit code when no files match, this will raise an exception. Use check=False and handle the exit code to support empty results.

Suggested change

result = subprocess.run(

["fd", "-HI", "-t", "f", pattern, path],

capture_output=True,

text=True,

check=True,

)

for line in result.stdout.splitlines():

line = line.strip()

if line:

found.append(os.path.normpath(line))

result = subprocess.run(

["fd", "-HI", "-t", "f", pattern, path],

capture_output=True,

text=True,

check=False,

)

# fd returns exit code 1 if no files match; treat this as empty result, not error

if result.returncode not in (0, 1):

result.check_returncode()

for line in result.stdout.splitlines():

line = line.strip()

if line:

found.append(os.path.normpath(line))

sourcery-ai · 2025-06-12T22:24:37Z

github_to_sqlite/cli.py

+            )
+
+        # Build file metadata
+        for build_path in utils.find_build_files(repo["full_name"]):


issue (bug_risk): find_build_files invoked on GitHub repo name, not a local path

find_build_files requires a local directory, but repo["full_name"] is a GitHub identifier. This will fail unless the repo is cloned. Please ensure the repo is available locally or adjust the file retrieval method.

sourcery-ai · 2025-06-12T22:24:37Z

setup.py

    """,
-    install_requires=["sqlite-utils>=2.7.2", "requests", "PyYAML"],
-    extras_require={"test": ["pytest", "requests-mock", "bs4"]},
+    install_requires=[


suggestion: Heavy dependencies in default install_requires

Consider moving large dependencies to extras_require to reduce the default installation footprint.

sourcery-ai · 2025-06-12T22:24:37Z

github_to_sqlite/simple_chunker.py

+from .config import config
+
+try:
+    from pydantic.v1 import BaseModel, Extra, validator


issue (bug_risk): Importing from pydantic.v1 may break under Pydantic v2

Since pydantic.v1 is not present in v2, update your imports to use pydantic directly or restrict the version requirement to <2.0.

sourcery-ai · 2025-06-12T22:24:38Z

tests/test_utils_functions.py

+def test_parse_build_file_json_and_toml(tmp_path):
+    json_file = tmp_path / "package.json"
+    json_file.write_text('{"name": "pkg", "author": "me"}')
+    toml_file = tmp_path / "pyproject.toml"
+    toml_file.write_text('name = "pkg"\nauthor = "you"')
+
+    assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}
+    result = utils.parse_build_file(str(toml_file))
+    assert result["name"] == "pkg"
+    assert result["author"] == "you"


suggestion (testing): Test error handling for parse_build_file.

Add tests for cases like non-existent file paths and malformed data to ensure the function returns an empty dictionary as intended.

Suggested change

def test_parse_build_file_json_and_toml(tmp_path):

json_file = tmp_path / "package.json"

json_file.write_text('{"name": "pkg", "author": "me"}')

toml_file = tmp_path / "pyproject.toml"

toml_file.write_text('name = "pkg"\nauthor = "you"')

assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}

result = utils.parse_build_file(str(toml_file))

assert result["name"] == "pkg"

assert result["author"] == "you"

def test_parse_build_file_json_and_toml(tmp_path):

json_file = tmp_path / "package.json"

json_file.write_text('{"name": "pkg", "author": "me"}')

toml_file = tmp_path / "pyproject.toml"

toml_file.write_text('name = "pkg"\nauthor = "you"')

assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}

result = utils.parse_build_file(str(toml_file))

assert result["name"] == "pkg"

assert result["author"] == "you"

def test_parse_build_file_error_handling(tmp_path):

# Non-existent file

non_existent_file = tmp_path / "does_not_exist.json"

assert utils.parse_build_file(str(non_existent_file)) == {}

# Malformed JSON

bad_json_file = tmp_path / "bad.json"

bad_json_file.write_text('{"name": "pkg", "author": }')

assert utils.parse_build_file(str(bad_json_file)) == {}

# Malformed TOML

bad_toml_file = tmp_path / "bad.toml"

bad_toml_file.write_text('name = "pkg"\nauthor = ')

assert utils.parse_build_file(str(bad_toml_file)) == {}

sourcery-ai · 2025-06-12T22:24:38Z

tests/test_starred_embeddings_command.py

+        if "repo_metadata" not in tables:
+            db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-12T22:24:38Z

github_to_sqlite/cli.py

+    default="auth.json",
+    help="Path to auth.json token file",
+)
+def starred_embeddings(db_path, model, force, verbose, auth):


issue (code-quality): Low code quality found in starred_embeddings - 17% (low-code-quality)

Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

sourcery-ai · 2025-06-12T22:24:38Z

github_to_sqlite/simple_chunker.py

+        if v is None:
+            return DenseEncoder(name="default")
+        return v


suggestion (code-quality): We've found these issues:

Lift code into else after jump in control flow (reintroduce-else)

Replace if statement with if expression (assign-if-exp)

Suggested change

if v is None:

return DenseEncoder(name="default")

return v

return DenseEncoder(name="default") if v is None else v

sourcery-ai · 2025-06-12T22:24:38Z

github_to_sqlite/utils.py

+                check=True,
+            )
+            for line in result.stdout.splitlines():
+                line = line.strip()


issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)

sourcery-ai · 2025-06-12T22:24:38Z

tests/test_find_build_files.py

+        if cmd == 'find':
+            return '/usr/bin/find'
+        return None
+
+    def fake_run(args, capture_output, text, check):
+        pattern = args[3]
+        calls.append((args[0], pattern))
+        class R:
+            def __init__(self):
+                self.stdout = 'repo/' + pattern + '\n'


issue (code-quality): We've found these issues:

Lift code into else after jump in control flow (reintroduce-else)

Replace if statement with if expression (assign-if-exp)

Use f-string instead of string concatenation (use-fstring-for-concatenation)

Copilot

Pull Request Overview

Adds missing tests for core utilities, expands setup dependencies, and implements embedding support with a new CLI command and documentation updates.

Add CLI command starred-embeddings with optional sqlite-vec support and directory metadata.
Introduce utility helpers: vector_to_blob, parse_build_file, directory_tree, and optional sqlite-vec loading.
Expand tests for migration, build-file discovery, database shape, config loading, and readme chunking.

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_migrate_command.py	Test that `migrate` CLI creates expected embedding tables
tests/test_find_build_files.py	Tests for `find_build_files()` using `fd`, `find`, and `walk`
tests/test_embedding_tables.py	Verify embedding tables are created by `ensure_db_shape()`
tests/test_config.py	Test loading tokenizer via env var and `tokenizers` monkeypatch
tests/test_chunk_readme.py	Tests for `chunk_readme()` fallback and optional chunker import
setup.py	Add runtime and extra dependencies for embeddings and docs
github_to_sqlite/utils.py	Add embedding table creation, sqlite-vec loader, helpers
github_to_sqlite/tokenization.py	Tokenizer loading with env/CLI fallback
github_to_sqlite/simple_chunker.py	Implement a Pydantic-based chunker with NLTK fallback
github_to_sqlite/sentencizer_chunker.py	Basic chunker splitting on period tokens
github_to_sqlite/config.py	Add pydantic-based default config values
github_to_sqlite/cli.py	Add `starred-embeddings` command with vector storage logic
docs/*.rst	Documentation updates for migrations, embeddings, and index
.github/workflows/test.yml	CI updated to run coverage and build docs

Comments suppressed due to low confidence (2)

github_to_sqlite/utils.py:1087

There are no unit tests covering vector_to_blob, parse_build_file, directory_tree, or _maybe_load_sqlite_vec. Consider adding tests for these helpers to ensure reliable behavior.

def vector_to_blob(vec) -> bytes:

github_to_sqlite/utils.py:751

[nitpick] Embedding dimension is hard-coded to 768; if a different model is used, this may cause a mismatch. Consider making the vector size configurable or detecting it at runtime.

                    title_embedding float[768],

Copilot · 2025-06-12T22:24:49Z

github_to_sqlite/utils.py

+        except Exception:
+            _SQLITE_VEC_LOADED = False
+    except Exception:


Catching all exceptions here can mask unrelated errors; consider catching ImportError and any specific exceptions raised by sqlite_vec.load instead.

Suggested change

except Exception:

_SQLITE_VEC_LOADED = False

except Exception:

except sqlite_vec.LoadError:

_SQLITE_VEC_LOADED = False

except ImportError:

Copilot · 2025-06-12T22:24:50Z

github_to_sqlite/utils.py

+                return json.load(fp)
+        with open(path, "rb") as fp:
+            return tomllib.load(fp)
+    except Exception:


This broad except silently ignores all errors when parsing build files; narrowing to json.JSONDecodeError, tomllib.TOMLDecodeError, and FileNotFoundError would avoid masking other issues.

Suggested change

except Exception:

except (json.JSONDecodeError, tomllib.TOMLDecodeError, FileNotFoundError) as e:

print(f"Error parsing build file {path}: {e}")

Copilot · 2025-06-12T22:24:50Z

github_to_sqlite/cli.py

+            click.echo("sqlite-vec extension not loaded; storing embeddings as BLOBs")
+
+    env_model = os.environ.get("GITHUB_TO_SQLITE_MODEL")
+    model_name = model or env_model or config.config.default_model


[nitpick] Referencing config.config.default_model is confusing because both the module and instance are named config. Import the instance directly (from github_to_sqlite.config import config) to use config.default_model.

Suggested change

model_name = model or env_model or config.config.default_model

model_name = model or env_model or config.default_model

Copilot · 2025-06-12T22:24:50Z

github_to_sqlite/utils.py

+import os
+import subprocess
+import shutil
+
+


[nitpick] Imports for os, subprocess, and shutil are mid-file; consider grouping all imports at the top of the module for clarity.

Suggested change

import os

import subprocess

import shutil

Add tests for utility helpers

6deeb73

Copilot AI review requested due to automatic review settings June 12, 2025 22:21

JeffCarpenter added the codex label Jun 12, 2025 — with ChatGPT Codex Connector

sourcery-ai bot reviewed Jun 12, 2025

View reviewed changes

Copilot AI reviewed Jun 12, 2025

View reviewed changes

		if "repo_metadata" not in tables:
		db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")

	except Exception:
	except (json.JSONDecodeError, tomllib.TOMLDecodeError, FileNotFoundError) as e:
	print(f"Error parsing build file {path}: {e}")

	model_name = model or env_model or config.config.default_model
	model_name = model or env_model or config.default_model

Comments

Conversation

JeffCarpenter commented Jun 12, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence Diagram for starred-embeddings Command

Entity Relationship Diagram for New Embedding Tables

Class Diagram for Chunker Implementations

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JeffCarpenter commented Jun 12, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jun 12, 2025 •

edited

Loading

Sequence Diagram for `starred-embeddings` Command