Skip to content

Comments

Add coverage for utility helpers#3

Open
JeffCarpenter wants to merge 1 commit intomainfrom
oanqia-codex/plan-feature-to-infer-embeddings-for-starred-repos
Open

Add coverage for utility helpers#3
JeffCarpenter wants to merge 1 commit intomainfrom
oanqia-codex/plan-feature-to-infer-embeddings-for-starred-repos

Conversation

@JeffCarpenter
Copy link
Owner

@JeffCarpenter JeffCarpenter commented Jun 12, 2025

Summary

  • add missing tests for vector_to_blob, parse_build_file, directory_tree, and sqlite-vec loading
  • patch starred-embeddings tests to fake the sqlite-vec module
  • note the new tests in PLAN

Testing

  • pytest -q
  • pytest --cov=github_to_sqlite --cov-branch -q

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Summary by Sourcery

Add coverage for utility helpers and embedding features, introduce new CLI commands for database migrations and starred repository embeddings, integrate sqlite-vec for vector storage, and update dependencies, documentation, and CI to support these changes.

New Features:

  • Add starred-embeddings CLI command to compute and store embeddings for starred repositories
  • Add migrate CLI command to create optional tables, views, and indexes
  • Introduce utility functions for build file detection, vector conversion, build file parsing, directory tree generation, and README chunking
  • Add support for loading a Hugging Face tokenizer via load_tokenizer

Enhancements:

  • Integrate optional sqlite-vec extension for efficient vector storage and search
  • Ensure embedding tables are created automatically during database setup
  • Extend configuration with default model and ONNX provider settings
  • Implement BasicSentencizerChunker and SimpleChunker for text chunking

CI:

  • Add new runtime and test dependencies in setup.py and extras_require
  • Update CI workflow to install docs dependencies, run tests with coverage, and build Sphinx docs

Documentation:

  • Update README with migration and starred-embeddings usage
  • Add Sphinx configuration and initial docs for embeddings and migrations

Tests:

  • Add unit tests for utility helpers (vector_to_blob, find_build_files, parse_build_file, directory_tree, chunk_readme, _maybe_load_sqlite_vec)
  • Add tests for new CLI commands (starred-embeddings, migrate) and chunker classes
  • Add pytest-cov for coverage reporting

Copilot AI review requested due to automatic review settings June 12, 2025 22:21
@sourcery-ai
Copy link

sourcery-ai bot commented Jun 12, 2025

Reviewer's Guide

This PR extends the github_to_sqlite package with new utility helpers for vector embedding storage, file parsing and chunking; integrates a new CLI command for generating embeddings on starred repositories (plus a migration command); updates documentation, setup dependencies and CI; and provides comprehensive tests covering all additions.

Sequence Diagram for starred-embeddings Command

sequenceDiagram
    actor User
    participant CLI as github-to-sqlite
    participant GitHub
    participant SentenceTransformer
    participant DB as SQLite Database

    User->>CLI: executes `starred-embeddings`
    CLI->>GitHub: fetch_all_starred(token)
    GitHub-->>CLI: returns starred repos
    loop for each repo
        CLI->>GitHub: fetch_readme(repo)
        GitHub-->>CLI: returns README content
        CLI->>CLI: chunk_readme(readme)
        CLI->>SentenceTransformer: encode(texts)
        SentenceTransformer-->>CLI: returns vectors
        CLI->>DB: upsert repo_embeddings
        CLI->>DB: upsert readme_chunk_embeddings
        CLI->>CLI: find_build_files(repo_path)
        CLI->>DB: upsert repo_build_files
        CLI->>DB: upsert repo_metadata
    end
Loading

Entity Relationship Diagram for New Embedding Tables

erDiagram
    repos {
        int id PK
        string full_name
    }
    repo_embeddings {
        int repo_id PK, FK
        bytes title_embedding
        bytes description_embedding
        bytes readme_embedding
    }
    readme_chunk_embeddings {
        int repo_id PK, FK
        int chunk_index PK
        string chunk_text
        bytes embedding
    }
    repo_build_files {
        int repo_id PK, FK
        string file_path PK
        string metadata
    }
    repo_metadata {
        int repo_id PK, FK
        string language
        string directory_tree
    }

    repos ||--o{ repo_embeddings : "has one"
    repos ||--|{ readme_chunk_embeddings : "has many"
    repos ||--|{ repo_build_files : "has many"
    repos ||--o{ repo_metadata : "has one"
Loading

Class Diagram for Chunker Implementations

classDiagram
    direction LR

    class BaseChunker {
        <<Abstract>>
        +name: str
        +splitter: BaseSplitter
        +__call__(docs: List[str]) List[List[Chunk]]
    }

    class SimpleChunker {
        +target_length: int
        +__call__(docs: List[str]) List[List[Chunk]]
        +_split(doc: str) List[str]
        +_chunk(sentences: List[str]) List[Chunk]
    }

    class BasicSentencizerChunker {
        +period_token: str
        +chunk(tokens, vectors) List[List[Any]]
    }

    class Chunk {
        <<Dataclass>>
        +content: str
        +token_count: int
        +is_triggered: bool
        +triggered_score: float
    }

    BaseChunker <|-- SimpleChunker
    SimpleChunker ..> Chunk : creates
Loading

File-Level Changes

Change Details Files
Add utility helpers for embedding storage, chunking and file parsing
  • Implement _maybe_load_sqlite_vec to detect and load the sqlite-vec extension
  • Add ensure_embedding_tables to create embedding and metadata tables
  • Introduce chunk_readme with optional semantic-chunkers fallback
  • Add find_build_files using fd/find/os.walk in preference order
  • Implement vector_to_blob, parse_build_file, and directory_tree helpers
github_to_sqlite/utils.py
Introduce starred-embeddings and migration CLI commands
  • Define 'starred-embeddings' command to fetch stars, chunk READMEs, run embeddings and upsert results
  • Implement '--model', '--force' and '--verbose' options in CLI
  • Add 'migrate' command to bootstrap optional embedding tables and indexes
github_to_sqlite/cli.py
Update setup, documentation and CI to support new features
  • Extend setup.py with embedding, chunker and tokenizer dependencies and extras
  • Add README sections for migrations, build file detection and the new embeddings command
  • Update GitHub Actions workflow to install docs deps, run coverage and build docs
  • Add PLAN.md to document the embedding feature rollout
setup.py
README.md
.github/workflows/test.yml
PLAN.md
Add tests covering new helpers, chunkers and CLI commands
  • Unit tests for vector_to_blob, parse_build_file, directory_tree and _maybe_load_sqlite_vec
  • Tests for chunk_readme fallback and semantic-chunkers integration
  • Tests for find_build_files under fd/find/walk scenarios
  • Tests for SimpleChunker and BasicSentencizerChunker behavior
  • Integration tests for starred-embeddings and migrate commands
tests/test_utils_functions.py
tests/test_chunk_readme.py
tests/test_find_build_files.py
github_to_sqlite/simple_chunker.py
tests/test_simple_chunker.py
github_to_sqlite/sentencizer_chunker.py
tests/test_sentencizer_chunker.py
tests/test_starred_embeddings_command.py
tests/test_migrate_command.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:

  • Add import re at the top of utils.py so chunk_readme's fallback doesn’t fail due to a missing re reference.
  • Instead of silently swallowing all exceptions in parse_build_file, log or differentiate parsing errors so malformed build files don't hide failures.
  • Consider moving heavy libraries (sentence-transformers, sqlite-vec, nltk, onnx, tokenizers) from core install_requires into optional extras to reduce the base install footprint.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Add `import re` at the top of `utils.py` so `chunk_readme`'s fallback doesn’t fail due to a missing `re` reference.
- Instead of silently swallowing all exceptions in `parse_build_file`, log or differentiate parsing errors so malformed build files don't hide failures.
- Consider moving heavy libraries (`sentence-transformers`, `sqlite-vec`, `nltk`, `onnx`, `tokenizers`) from core `install_requires` into optional extras to reduce the base install footprint.

## Individual Comments

### Comment 1
<location> `github_to_sqlite/utils.py:1053` </location>
<code_context>
+                ["fd", "-HI", "-t", "f", pattern, path],
+                capture_output=True,
+                text=True,
+                check=True,
+            )
+            for line in result.stdout.splitlines():
</code_context>

<issue_to_address>
Using check=True will propagate errors on no matches

Since `fd` returns a non-zero exit code when no files match, this will raise an exception. Use `check=False` and handle the exit code to support empty results.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
            result = subprocess.run(
                ["fd", "-HI", "-t", "f", pattern, path],
                capture_output=True,
                text=True,
                check=True,
            )
            for line in result.stdout.splitlines():
                line = line.strip()
                if line:
                    found.append(os.path.normpath(line))
=======
            result = subprocess.run(
                ["fd", "-HI", "-t", "f", pattern, path],
                capture_output=True,
                text=True,
                check=False,
            )
            # fd returns exit code 1 if no files match; treat this as empty result, not error
            if result.returncode not in (0, 1):
                result.check_returncode()
            for line in result.stdout.splitlines():
                line = line.strip()
                if line:
                    found.append(os.path.normpath(line))
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `github_to_sqlite/cli.py:730` </location>
<code_context>
+            )
+
+        # Build file metadata
+        for build_path in utils.find_build_files(repo["full_name"]):
+            metadata = utils.parse_build_file(os.path.join(repo["full_name"], build_path))
+            db["repo_build_files"].upsert(
</code_context>

<issue_to_address>
find_build_files invoked on GitHub repo name, not a local path

`find_build_files` requires a local directory, but `repo["full_name"]` is a GitHub identifier. This will fail unless the repo is cloned. Please ensure the repo is available locally or adjust the file retrieval method.
</issue_to_address>

### Comment 3
<location> `setup.py:29` </location>
<code_context>
     """,
-    install_requires=["sqlite-utils>=2.7.2", "requests", "PyYAML"],
-    extras_require={"test": ["pytest", "requests-mock", "bs4"]},
+    install_requires=[
+        "sqlite-utils>=2.7.2",
+        "requests",
</code_context>

<issue_to_address>
Heavy dependencies in default install_requires

Consider moving large dependencies to `extras_require` to reduce the default installation footprint.
</issue_to_address>

### Comment 4
<location> `github_to_sqlite/simple_chunker.py:7` </location>
<code_context>
+from .config import config
+
+try:
+    from pydantic.v1 import BaseModel, Extra, validator
+    from colorama import Fore, Style
+    from semantic_router.encoders.base import DenseEncoder
</code_context>

<issue_to_address>
Importing from pydantic.v1 may break under Pydantic v2

Since `pydantic.v1` is not present in v2, update your imports to use `pydantic` directly or restrict the version requirement to <2.0.
</issue_to_address>

### Comment 5
<location> `tests/test_utils_functions.py:20` </location>
<code_context>
+    assert arr.tolist() == [1.0, 2.0, 3.5]
+
+
+def test_parse_build_file_json_and_toml(tmp_path):
+    json_file = tmp_path / "package.json"
+    json_file.write_text('{"name": "pkg", "author": "me"}')
</code_context>

<issue_to_address>
Test error handling for `parse_build_file`.

Add tests for cases like non-existent file paths and malformed data to ensure the function returns an empty dictionary as intended.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_parse_build_file_json_and_toml(tmp_path):
    json_file = tmp_path / "package.json"
    json_file.write_text('{"name": "pkg", "author": "me"}')
    toml_file = tmp_path / "pyproject.toml"
    toml_file.write_text('name = "pkg"\nauthor = "you"')

    assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}
    result = utils.parse_build_file(str(toml_file))
    assert result["name"] == "pkg"
    assert result["author"] == "you"
=======
def test_parse_build_file_json_and_toml(tmp_path):
    json_file = tmp_path / "package.json"
    json_file.write_text('{"name": "pkg", "author": "me"}')
    toml_file = tmp_path / "pyproject.toml"
    toml_file.write_text('name = "pkg"\nauthor = "you"')

    assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}
    result = utils.parse_build_file(str(toml_file))
    assert result["name"] == "pkg"
    assert result["author"] == "you"

def test_parse_build_file_error_handling(tmp_path):
    # Non-existent file
    non_existent_file = tmp_path / "does_not_exist.json"
    assert utils.parse_build_file(str(non_existent_file)) == {}

    # Malformed JSON
    bad_json_file = tmp_path / "bad.json"
    bad_json_file.write_text('{"name": "pkg", "author": }')
    assert utils.parse_build_file(str(bad_json_file)) == {}

    # Malformed TOML
    bad_toml_file = tmp_path / "bad.toml"
    bad_toml_file.write_text('name = "pkg"\nauthor = ')
    assert utils.parse_build_file(str(bad_toml_file)) == {}
>>>>>>> REPLACE

</suggested_fix>

### Comment 6
<location> `tests/test_utils_functions.py:42` </location>
<code_context>
+    assert tree["sub"] == {"dirs": [], "files": ["file.txt"]}
+
+
+def test_maybe_load_sqlite_vec(monkeypatch):
+    db = sqlite_utils.Database(memory=True)
+
</code_context>

<issue_to_address>
Cover the case where `sqlite_vec.load()` fails.

Add a test that mocks `sqlite_vec.load` to raise an exception and verifies that `_maybe_load_sqlite_vec` returns `False`.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_maybe_load_sqlite_vec(monkeypatch):
    db = sqlite_utils.Database(memory=True)

=======
def test_maybe_load_sqlite_vec(monkeypatch):
    db = sqlite_utils.Database(memory=True)

def test_maybe_load_sqlite_vec_load_failure(monkeypatch):
    # Arrange: mock sqlite_vec.load to raise an exception
    import github_to_sqlite.utils as utils_mod

    def mock_load(db):
        raise RuntimeError("load failed")

    monkeypatch.setattr(utils_mod.sqlite_vec, "load", mock_load)
    # Act & Assert: _maybe_load_sqlite_vec should return False
    assert utils_mod._maybe_load_sqlite_vec(db) is False
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +1049 to +1058
result = subprocess.run(
["fd", "-HI", "-t", "f", pattern, path],
capture_output=True,
text=True,
check=True,
)
for line in result.stdout.splitlines():
line = line.strip()
if line:
found.append(os.path.normpath(line))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Using check=True will propagate errors on no matches

Since fd returns a non-zero exit code when no files match, this will raise an exception. Use check=False and handle the exit code to support empty results.

Suggested change
result = subprocess.run(
["fd", "-HI", "-t", "f", pattern, path],
capture_output=True,
text=True,
check=True,
)
for line in result.stdout.splitlines():
line = line.strip()
if line:
found.append(os.path.normpath(line))
result = subprocess.run(
["fd", "-HI", "-t", "f", pattern, path],
capture_output=True,
text=True,
check=False,
)
# fd returns exit code 1 if no files match; treat this as empty result, not error
if result.returncode not in (0, 1):
result.check_returncode()
for line in result.stdout.splitlines():
line = line.strip()
if line:
found.append(os.path.normpath(line))

)

# Build file metadata
for build_path in utils.find_build_files(repo["full_name"]):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): find_build_files invoked on GitHub repo name, not a local path

find_build_files requires a local directory, but repo["full_name"] is a GitHub identifier. This will fail unless the repo is cloned. Please ensure the repo is available locally or adjust the file retrieval method.

""",
install_requires=["sqlite-utils>=2.7.2", "requests", "PyYAML"],
extras_require={"test": ["pytest", "requests-mock", "bs4"]},
install_requires=[
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Heavy dependencies in default install_requires

Consider moving large dependencies to extras_require to reduce the default installation footprint.

from .config import config

try:
from pydantic.v1 import BaseModel, Extra, validator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Importing from pydantic.v1 may break under Pydantic v2

Since pydantic.v1 is not present in v2, update your imports to use pydantic directly or restrict the version requirement to <2.0.

Comment on lines +20 to +29
def test_parse_build_file_json_and_toml(tmp_path):
json_file = tmp_path / "package.json"
json_file.write_text('{"name": "pkg", "author": "me"}')
toml_file = tmp_path / "pyproject.toml"
toml_file.write_text('name = "pkg"\nauthor = "you"')

assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}
result = utils.parse_build_file(str(toml_file))
assert result["name"] == "pkg"
assert result["author"] == "you"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Test error handling for parse_build_file.

Add tests for cases like non-existent file paths and malformed data to ensure the function returns an empty dictionary as intended.

Suggested change
def test_parse_build_file_json_and_toml(tmp_path):
json_file = tmp_path / "package.json"
json_file.write_text('{"name": "pkg", "author": "me"}')
toml_file = tmp_path / "pyproject.toml"
toml_file.write_text('name = "pkg"\nauthor = "you"')
assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}
result = utils.parse_build_file(str(toml_file))
assert result["name"] == "pkg"
assert result["author"] == "you"
def test_parse_build_file_json_and_toml(tmp_path):
json_file = tmp_path / "package.json"
json_file.write_text('{"name": "pkg", "author": "me"}')
toml_file = tmp_path / "pyproject.toml"
toml_file.write_text('name = "pkg"\nauthor = "you"')
assert utils.parse_build_file(str(json_file)) == {"name": "pkg", "author": "me"}
result = utils.parse_build_file(str(toml_file))
assert result["name"] == "pkg"
assert result["author"] == "you"
def test_parse_build_file_error_handling(tmp_path):
# Non-existent file
non_existent_file = tmp_path / "does_not_exist.json"
assert utils.parse_build_file(str(non_existent_file)) == {}
# Malformed JSON
bad_json_file = tmp_path / "bad.json"
bad_json_file.write_text('{"name": "pkg", "author": }')
assert utils.parse_build_file(str(bad_json_file)) == {}
# Malformed TOML
bad_toml_file = tmp_path / "bad.toml"
bad_toml_file.write_text('name = "pkg"\nauthor = ')
assert utils.parse_build_file(str(bad_toml_file)) == {}

Comment on lines +77 to +78
if "repo_metadata" not in tables:
db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

default="auth.json",
help="Path to auth.json token file",
)
def starred_embeddings(db_path, model, force, verbose, auth):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Low code quality found in starred_embeddings - 17% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

Comment on lines +54 to +56
if v is None:
return DenseEncoder(name="default")
return v
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
if v is None:
return DenseEncoder(name="default")
return v
return DenseEncoder(name="default") if v is None else v

check=True,
)
for line in result.stdout.splitlines():
line = line.strip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)

Comment on lines +41 to +50
if cmd == 'find':
return '/usr/bin/find'
return None

def fake_run(args, capture_output, text, check):
pattern = args[3]
calls.append((args[0], pattern))
class R:
def __init__(self):
self.stdout = 'repo/' + pattern + '\n'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds missing tests for core utilities, expands setup dependencies, and implements embedding support with a new CLI command and documentation updates.

  • Add CLI command starred-embeddings with optional sqlite-vec support and directory metadata.
  • Introduce utility helpers: vector_to_blob, parse_build_file, directory_tree, and optional sqlite-vec loading.
  • Expand tests for migration, build-file discovery, database shape, config loading, and readme chunking.

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_migrate_command.py Test that migrate CLI creates expected embedding tables
tests/test_find_build_files.py Tests for find_build_files() using fd, find, and walk
tests/test_embedding_tables.py Verify embedding tables are created by ensure_db_shape()
tests/test_config.py Test loading tokenizer via env var and tokenizers monkeypatch
tests/test_chunk_readme.py Tests for chunk_readme() fallback and optional chunker import
setup.py Add runtime and extra dependencies for embeddings and docs
github_to_sqlite/utils.py Add embedding table creation, sqlite-vec loader, helpers
github_to_sqlite/tokenization.py Tokenizer loading with env/CLI fallback
github_to_sqlite/simple_chunker.py Implement a Pydantic-based chunker with NLTK fallback
github_to_sqlite/sentencizer_chunker.py Basic chunker splitting on period tokens
github_to_sqlite/config.py Add pydantic-based default config values
github_to_sqlite/cli.py Add starred-embeddings command with vector storage logic
docs/*.rst Documentation updates for migrations, embeddings, and index
.github/workflows/test.yml CI updated to run coverage and build docs
Comments suppressed due to low confidence (2)

github_to_sqlite/utils.py:1087

  • There are no unit tests covering vector_to_blob, parse_build_file, directory_tree, or _maybe_load_sqlite_vec. Consider adding tests for these helpers to ensure reliable behavior.
def vector_to_blob(vec) -> bytes:

github_to_sqlite/utils.py:751

  • [nitpick] Embedding dimension is hard-coded to 768; if a different model is used, this may cause a mismatch. Consider making the vector size configurable or detecting it at runtime.
                    title_embedding float[768],

Comment on lines +732 to +734
except Exception:
_SQLITE_VEC_LOADED = False
except Exception:
Copy link

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching all exceptions here can mask unrelated errors; consider catching ImportError and any specific exceptions raised by sqlite_vec.load instead.

Suggested change
except Exception:
_SQLITE_VEC_LOADED = False
except Exception:
except sqlite_vec.LoadError:
_SQLITE_VEC_LOADED = False
except ImportError:

Copilot uses AI. Check for mistakes.
return json.load(fp)
with open(path, "rb") as fp:
return tomllib.load(fp)
except Exception:
Copy link

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broad except silently ignores all errors when parsing build files; narrowing to json.JSONDecodeError, tomllib.TOMLDecodeError, and FileNotFoundError would avoid masking other issues.

Suggested change
except Exception:
except (json.JSONDecodeError, tomllib.TOMLDecodeError, FileNotFoundError) as e:
print(f"Error parsing build file {path}: {e}")

Copilot uses AI. Check for mistakes.
click.echo("sqlite-vec extension not loaded; storing embeddings as BLOBs")

env_model = os.environ.get("GITHUB_TO_SQLITE_MODEL")
model_name = model or env_model or config.config.default_model
Copy link

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Referencing config.config.default_model is confusing because both the module and instance are named config. Import the instance directly (from github_to_sqlite.config import config) to use config.default_model.

Suggested change
model_name = model or env_model or config.config.default_model
model_name = model or env_model or config.default_model

Copilot uses AI. Check for mistakes.
Comment on lines +1031 to +1035
import os
import subprocess
import shutil


Copy link

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Imports for os, subprocess, and shutil are mid-file; consider grouping all imports at the top of the module for clarity.

Suggested change
import os
import subprocess
import shutil

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant