Refactor utils and improve tokenizer handling by JeffCarpenter · Pull Request #13 · JeffCarpenter/github-to-sqlite

JeffCarpenter · 2025-08-04T11:27:52Z

Summary

split build file and embedding helpers into new modules
use AutoTokenizer for broader compatibility and update tests
handle fd/find errors with warnings
cache sqlite-vec loading per-connection
fix CLI missing import

Testing

ruff check --fix .
pytest --cov=github_to_sqlite --cov-branch -q
mypy github_to_sqlite

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Summary by Sourcery

Refactor utility code into separate modules and introduce comprehensive support for generating and storing embeddings for starred repositories, along with improved tokenizer and build-file handling, enhanced database setup, updated CI, and documentation.

New Features:

Add starred-embeddings CLI command to compute and store title, description, README chunk embeddings, build metadata, and repo metadata for starred repositories
Add migrate CLI command to initialize optional embedding tables, FTS, views, and foreign keys without running main workflows
Introduce tokenization.load_tokenizer using Hugging Face AutoTokenizer for consistent tokenizer loading
Implement find_build_files helper to locate and parse common build definition files using fd, find, or os.walk with warning handling
Add chunk_readme utility to split README text via semantic_chunkers or blank-line fallback

Bug Fixes:

Catch and warn on fd/find subprocess failures instead of erroring out
Fix missing imports and type annotations in CLI commands to correctly cast sqlite-utils tables

Enhancements:

Split monolithic utils.py into dedicated modules: embedding_utils, build_files, tokenization, config, simple_chunker, and sentencizer_chunker
Cache sqlite-vec extension loading per database connection to avoid repeated extension loads
Extend ensure_db_shape to automatically create and configure embedding tables alongside FTS, views, and foreign keys

Build:

Update setup.py to include new optional dependencies for semantic_chunkers, sentence-transformers, sqlite-vec, NLP tooling, and documentation

CI:

Enhance CI workflow to install docs extras, run ruff lint checks, mypy type checks, and build Sphinx documentation

Documentation:

Revise README to document migrations and starred-embeddings usage
Add Sphinx conf and initial rst pages for migrations and embeddings
Include PLAN.md outlining the embedding feature plan

Tests:

Add extensive tests for build file discovery, embedding table utilities, chunk_readme, tokenization loader, simple and sentencizer chunkers, starred-embeddings command, migration command, and embedding integration tests

sourcery-ai · 2025-08-04T11:27:57Z

Reviewer's Guide

Refactors utility code into dedicated modules and implements a new embeddings pipeline with CLI commands for starred repositories and migrations, switches to AutoTokenizer, enhances build file discovery and chunking, caches sqlite-vec loading, and updates dependencies, CI, documentation, and tests.

Class diagram for new and refactored utility modules

classDiagram
    class Config {
      +str default_model
      +str onnx_provider
      +int max_length
      +int embedding_dim
      +list[str] build_patterns
    }
    class Token {
      +int id
      +str value
      +tuple[int, int] offsets
    }
    class Chunk {
      +str content
      +int token_count
      +bool is_triggered
      +float triggered_score
    }
    class BaseSplitter {
      +__call__(doc: str) List[str]
    }
    class BaseChunker {
      +str name
      +BaseSplitter splitter
      +DenseEncoder encoder
      +__call__(docs: List[str]) List[List[Chunk]]
      +_split(doc: str) List[str]
      +_chunk(splits: List[Any]) List[Chunk]
      +print(document_splits: List[Chunk])
    }
    class SimpleChunker {
      +int target_length
      +__call__(docs: List[str]) List[List[Chunk]]
      +_split(doc: str) List[str]
      +_chunk(sentences: List[str]) List[Chunk]
    }
    class BasicSentencizerChunker {
      +str period_token
      +chunk(tokens: Sequence[str], vectors: Sequence[Any]) List[List[Any]]
      +__call__ = chunk
    }
    class Tokenizer
    class AutoTokenizer
    class DenseEncoder

    SimpleChunker --|> BaseChunker
    BaseChunker o-- BaseSplitter
    BaseChunker o-- DenseEncoder
    Tokenization "1" --o "*" Token
    AutoTokenizer <|-- Tokenizer

File-Level Changes

Change	Details	Files
Refactor utility functions into dedicated modules	Extract build file discovery and parsing into build_files.py Move sqlite-vec loading and embedding table management into embedding_utils.py Create tokenization.py for AutoTokenizer loading logic Introduce simple_chunker.py, sentencizer_chunker.py, and config.py for chunking and configuration	`github_to_sqlite/utils.py` `github_to_sqlite/build_files.py` `github_to_sqlite/embedding_utils.py` `github_to_sqlite/tokenization.py` `github_to_sqlite/simple_chunker.py` `github_to_sqlite/sentencizer_chunker.py` `github_to_sqlite/config.py`
Implement 'starred-embeddings' command and embeddings pipeline	Add generate_starred_embeddings in utils to fetch stars, chunk readmes, infer vectors, and upsert embedding tables Introduce starred-embeddings CLI command and migrate command in cli.py Update ensure_db_shape to call ensure_embedding_tables for migration support	`github_to_sqlite/utils.py` `github_to_sqlite/cli.py` `github_to_sqlite/embedding_utils.py`
Improve build file search with error warnings and pattern fallback	find_build_files now uses fd/find with warnings on errors and falls back to os.walk Support custom glob patterns and normalize paths in _post_process_build_files Deduplicate and filter junk paths before returning results	`github_to_sqlite/build_files.py`
Switch to AutoTokenizer for loading tokenizers	Add load_tokenizer using AutoTokenizer.from_pretrained with environment fallback Raise clear RuntimeError on tokenizer load failures Add tests for tokenizer loading and error scenarios	`github_to_sqlite/tokenization.py` `tests/test_config.py`
Cache sqlite-vec extension loading per connection	Implement _maybe_load_sqlite_vec with per-connection caching Use cached result in ensure_embedding_tables to create virtual or regular tables Add tests for sqlite-vec load success and failure	`github_to_sqlite/embedding_utils.py` `tests/test_utils_functions.py` `tests/test_embedding_tables.py`
Enhance README chunking with semantic and blank-line fallback	Add chunk_readme to utils to use StatisticalChunker when available Fallback to splitting on blank lines without optional dependency Add tests for chunk_readme behavior	`github_to_sqlite/utils.py` `tests/test_chunk_readme.py`
Extend dependencies and CI workflows for testing, linting, and docs	Update setup.py with extras_require for test, docs, semantic_chunkers, vector-search, and GPU Modify GitHub Actions to install docs deps, run ruff, mypy, and build Sphinx docs Adjust CI commands to include coverage flags and linting	`setup.py` `.github/workflows/test.yml`
Expand test suite to cover new modules and commands	Add unit and integration tests for starred-embeddings, build file discovery, migration, and chunkers Update existing tests to include embedding and metadata tables Ensure coverage runs for new code paths	`tests/`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

Copilot

Pull Request Overview

This PR introduces a comprehensive embeddings feature for starred repositories while improving tokenizer handling and refactoring utility code. The changes enable generation and storage of sentence-transformer embeddings for repository metadata, with proper chunking of README content and build file metadata extraction.

Add starred-embeddings CLI command for computing repository embeddings using sentence-transformers
Refactor utilities into focused modules (build files, embeddings, tokenization, chunking)
Switch to AutoTokenizer for broader model compatibility and add better error handling

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_workflows.py	Remove unused import for ForeignKey
tests/test_stargazers.py	Remove unused import for ForeignKey
tests/test_repos.py	Remove unused import and duplicate pytest import
tests/*.py (new files)	Add comprehensive test coverage for new modules and CLI commands
setup.py	Add new dependency groups for testing, vector search, chunking, and documentation
mypy.ini	Add mypy configuration for type checking
github_to_sqlite/utils.py	Add embedding generation function and refactor imports from new modules
github_to_sqlite/*.py (new modules)	Add modular code for tokenization, chunking, build files, and embedding management
docs/*.rst	Add Sphinx documentation for embeddings and migrations
.github/workflows/test.yml	Update CI to run additional tools and build documentation
README.md	Add documentation for new migration and embedding features

Comments suppressed due to low confidence (1)

Copilot · 2025-08-04T11:30:50Z

tests/test_repos.py

@@ -2,11 +2,9 @@
 import pytest


Duplicate import statement for pytest. Remove one of the pytest imports as it's imported on line 2.

Copilot · 2025-08-04T11:30:51Z

github_to_sqlite/cli.py

                repos_seen.add(pr_repo_url)
            utils.save_pull_requests(db, [pull_request], pr_repo)
    else:
+        from typing import Iterable, Any


[nitpick] Import statement inside function body. Consider moving this import to the top of the file with other imports for better code organization.

Suggested change

from typing import Iterable, Any

Copilot · 2025-08-04T11:30:52Z

github_to_sqlite/cli.py


-    def stop_when(commit):
+    from typing import Callable
+


[nitpick] Import statement inside function body. Consider moving this import to the top of the file with other imports for better code organization.

Suggested change

Copilot · 2025-08-04T11:30:52Z

github_to_sqlite/simple_chunker.py

+    Style = _Style
+
+from typing import TYPE_CHECKING
+


[nitpick] TYPE_CHECKING import should be grouped with other typing imports at the top of the file for better organization.

Suggested change

sourcery-ai

Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:

There are still many print statements sprinkled throughout the core functions—consider switching to the standard logging module so verbosity and error reporting can be controlled more flexibly.
The utils.py file has grown very large again; think about breaking out remaining HTTP/repo helpers into their own modules to keep each file’s responsibilities focused.
CLI code is repetitively casting tables via cast(sqlite_utils.db.Table, …); you could introduce a small helper to fetch a typed table instance and DRY up those calls.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- There are still many print statements sprinkled throughout the core functions—consider switching to the standard logging module so verbosity and error reporting can be controlled more flexibly.
- The utils.py file has grown very large again; think about breaking out remaining HTTP/repo helpers into their own modules to keep each file’s responsibilities focused.
- CLI code is repetitively casting tables via cast(sqlite_utils.db.Table, …); you could introduce a small helper to fetch a typed table instance and DRY up those calls.

## Individual Comments

### Comment 1
<location> `github_to_sqlite/simple_chunker.py:135` </location>
<code_context>
+
+    def _chunk(self, sentences: List[str]) -> List[Chunk]:
+        chunks: List[Chunk] = []
+        for i in range(0, len(sentences), self.target_length):
+            piece = sentences[i : i + self.target_length]
+            if len(piece) < self.target_length:
</code_context>

<issue_to_address>
The chunking logic skips the final chunk if it is shorter than target_length.

Yield the final chunk even if it's shorter than target_length to avoid missing content.
</issue_to_address>

### Comment 2
<location> `github_to_sqlite/utils.py:1053` </location>
<code_context>
+                pk=("repo_id", "chunk_index"),
+            )
+
+        for build_path in find_build_files(repo["full_name"], patterns=patterns):
+            metadata = parse_build_file(os.path.join(repo["full_name"], build_path))
+            cast(sqlite_utils.db.Table, db["repo_build_files"]).upsert(
</code_context>

<issue_to_address>
find_build_files is called with repo["full_name"], which may not be a local path.

If repo["full_name"] is a GitHub identifier like 'owner/repo', it may not map to a local directory, which could cause find_build_files to fail. Please verify that a valid local path is used.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-08-04T11:32:38Z

github_to_sqlite/simple_chunker.py

+
+    def _chunk(self, sentences: List[str]) -> List[Chunk]:
+        chunks: List[Chunk] = []
+        for i in range(0, len(sentences), self.target_length):


suggestion (bug_risk): The chunking logic skips the final chunk if it is shorter than target_length.

Yield the final chunk even if it's shorter than target_length to avoid missing content.

sourcery-ai · 2025-08-04T11:32:38Z

github_to_sqlite/utils.py

+                pk=("repo_id", "chunk_index"),
+            )
+
+        for build_path in find_build_files(repo["full_name"], patterns=patterns):


issue (bug_risk): find_build_files is called with repo["full_name"], which may not be a local path.

If repo["full_name"] is a GitHub identifier like 'owner/repo', it may not map to a local directory, which could cause find_build_files to fail. Please verify that a valid local path is used.

sourcery-ai · 2025-08-04T11:32:38Z

tests/test_find_build_files.py

+        if pattern == 'package.json':
+            raise subprocess.CalledProcessError(1, args)


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-08-04T11:32:38Z

tests/test_find_build_files.py

+        if cmd == 'find':
+            return '/usr/bin/find'


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-08-04T11:32:39Z

tests/test_find_build_files.py

+        if pattern == 'package.json':
+            raise subprocess.CalledProcessError(1, args)


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-08-04T11:32:39Z

tests/test_starred_embeddings_command.py

+        if "repo_metadata" not in tables:
+            db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-08-04T11:32:39Z

github_to_sqlite/build_files.py

+            for full in pathlib.Path(path).rglob(pattern):
+                if full.is_file():
+                    found.append(str(full))


suggestion (code-quality): Replace a for append loop with list extend (for-append-to-extend)

Suggested change

for full in pathlib.Path(path).rglob(pattern):

if full.is_file():

found.append(str(full))

found.extend(

str(full)

for full in pathlib.Path(path).rglob(pattern)

if full.is_file()

)

sourcery-ai · 2025-08-04T11:32:39Z

github_to_sqlite/utils.py

    from bs4 import BeautifulSoup

-    url = "https://github.com/{}/network/dependents".format(repo)
+    url: str | None = "https://github.com/{}/network/dependents".format(repo)


suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)

Suggested change

url: str | None = "https://github.com/{}/network/dependents".format(repo)

url: str | None = f"https://github.com/{repo}/network/dependents"

sourcery-ai · 2025-08-04T11:32:39Z

github_to_sqlite/utils.py

        )
+
+
+def generate_starred_embeddings(


issue (code-quality): Low code quality found in generate_starred_embeddings - 15% (low-code-quality)

Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

sourcery-ai · 2025-08-04T11:32:39Z

tests/test_find_build_files.py

+    calls = []
+
+    def fake_which(cmd):
+        if cmd == 'find':


issue (code-quality): We've found these issues:

Lift code into else after jump in control flow (reintroduce-else)

Replace if statement with if expression (assign-if-exp)

Use f-string instead of string concatenation (use-fstring-for-concatenation)

Co-authored-by: JeffCarpenter <178424+JeffCarpenter@users.noreply.github.com>

Refactor utils into modules and improve tokenizer

63e2b69

JeffCarpenter added the codex label Aug 4, 2025 — with ChatGPT Codex Connector

JeffCarpenter marked this pull request as ready for review August 4, 2025 11:29

Copilot AI review requested due to automatic review settings August 4, 2025 11:29

Copilot AI reviewed Aug 4, 2025

View reviewed changes

sourcery-ai bot reviewed Aug 4, 2025

View reviewed changes

Copilot AI mentioned this pull request Aug 10, 2025

Fix critical bugs and import organization issues from bot reviewer feedback #15

Merged

Copilot AI and others added 3 commits September 28, 2025 17:35

Fix critical bugs and import organization issues per bot review

719f77d

Co-authored-by: JeffCarpenter <178424+JeffCarpenter@users.noreply.github.com>

Complete bot reviewer comments refactoring with verification

67b0edb

Co-authored-by: JeffCarpenter <178424+JeffCarpenter@users.noreply.github.com>

Fix ImportError: update test import to use build_files module directly

b1f1805

Co-authored-by: JeffCarpenter <178424+JeffCarpenter@users.noreply.github.com>

		if pattern == 'package.json':
		raise subprocess.CalledProcessError(1, args)

		if "repo_metadata" not in tables:
		db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")

	url: str \| None = "https://github.com/{}/network/dependents".format(repo)
	url: str \| None = f"https://github.com/{repo}/network/dependents"

Conversation

JeffCarpenter commented Aug 4, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for new and refactored utility modules

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

JeffCarpenter commented Aug 4, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 4, 2025 •

edited

Loading