Skip to content

Refactor utils and improve tokenizer handling#13

Open
JeffCarpenter wants to merge 4 commits intomainfrom
hefgje-codex/plan-feature-to-infer-embeddings-for-starred-repos
Open

Refactor utils and improve tokenizer handling#13
JeffCarpenter wants to merge 4 commits intomainfrom
hefgje-codex/plan-feature-to-infer-embeddings-for-starred-repos

Conversation

@JeffCarpenter
Copy link
Owner

@JeffCarpenter JeffCarpenter commented Aug 4, 2025

Summary

  • split build file and embedding helpers into new modules
  • use AutoTokenizer for broader compatibility and update tests
  • handle fd/find errors with warnings
  • cache sqlite-vec loading per-connection
  • fix CLI missing import

Testing

  • ruff check --fix .
  • pytest --cov=github_to_sqlite --cov-branch -q
  • mypy github_to_sqlite

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Summary by Sourcery

Refactor utility code into separate modules and introduce comprehensive support for generating and storing embeddings for starred repositories, along with improved tokenizer and build-file handling, enhanced database setup, updated CI, and documentation.

New Features:

  • Add starred-embeddings CLI command to compute and store title, description, README chunk embeddings, build metadata, and repo metadata for starred repositories
  • Add migrate CLI command to initialize optional embedding tables, FTS, views, and foreign keys without running main workflows
  • Introduce tokenization.load_tokenizer using Hugging Face AutoTokenizer for consistent tokenizer loading
  • Implement find_build_files helper to locate and parse common build definition files using fd, find, or os.walk with warning handling
  • Add chunk_readme utility to split README text via semantic_chunkers or blank-line fallback

Bug Fixes:

  • Catch and warn on fd/find subprocess failures instead of erroring out
  • Fix missing imports and type annotations in CLI commands to correctly cast sqlite-utils tables

Enhancements:

  • Split monolithic utils.py into dedicated modules: embedding_utils, build_files, tokenization, config, simple_chunker, and sentencizer_chunker
  • Cache sqlite-vec extension loading per database connection to avoid repeated extension loads
  • Extend ensure_db_shape to automatically create and configure embedding tables alongside FTS, views, and foreign keys

Build:

  • Update setup.py to include new optional dependencies for semantic_chunkers, sentence-transformers, sqlite-vec, NLP tooling, and documentation

CI:

  • Enhance CI workflow to install docs extras, run ruff lint checks, mypy type checks, and build Sphinx documentation

Documentation:

  • Revise README to document migrations and starred-embeddings usage
  • Add Sphinx conf and initial rst pages for migrations and embeddings
  • Include PLAN.md outlining the embedding feature plan

Tests:

  • Add extensive tests for build file discovery, embedding table utilities, chunk_readme, tokenization loader, simple and sentencizer chunkers, starred-embeddings command, migration command, and embedding integration tests

@sourcery-ai
Copy link

sourcery-ai bot commented Aug 4, 2025

Reviewer's Guide

Refactors utility code into dedicated modules and implements a new embeddings pipeline with CLI commands for starred repositories and migrations, switches to AutoTokenizer, enhances build file discovery and chunking, caches sqlite-vec loading, and updates dependencies, CI, documentation, and tests.

Class diagram for new and refactored utility modules

classDiagram
    class Config {
      +str default_model
      +str onnx_provider
      +int max_length
      +int embedding_dim
      +list[str] build_patterns
    }
    class Token {
      +int id
      +str value
      +tuple[int, int] offsets
    }
    class Chunk {
      +str content
      +int token_count
      +bool is_triggered
      +float triggered_score
    }
    class BaseSplitter {
      +__call__(doc: str) List[str]
    }
    class BaseChunker {
      +str name
      +BaseSplitter splitter
      +DenseEncoder encoder
      +__call__(docs: List[str]) List[List[Chunk]]
      +_split(doc: str) List[str]
      +_chunk(splits: List[Any]) List[Chunk]
      +print(document_splits: List[Chunk])
    }
    class SimpleChunker {
      +int target_length
      +__call__(docs: List[str]) List[List[Chunk]]
      +_split(doc: str) List[str]
      +_chunk(sentences: List[str]) List[Chunk]
    }
    class BasicSentencizerChunker {
      +str period_token
      +chunk(tokens: Sequence[str], vectors: Sequence[Any]) List[List[Any]]
      +__call__ = chunk
    }
    class Tokenizer
    class AutoTokenizer
    class DenseEncoder

    SimpleChunker --|> BaseChunker
    BaseChunker o-- BaseSplitter
    BaseChunker o-- DenseEncoder
    Tokenization "1" --o "*" Token
    AutoTokenizer <|-- Tokenizer
Loading

File-Level Changes

Change Details Files
Refactor utility functions into dedicated modules
  • Extract build file discovery and parsing into build_files.py
  • Move sqlite-vec loading and embedding table management into embedding_utils.py
  • Create tokenization.py for AutoTokenizer loading logic
  • Introduce simple_chunker.py, sentencizer_chunker.py, and config.py for chunking and configuration
github_to_sqlite/utils.py
github_to_sqlite/build_files.py
github_to_sqlite/embedding_utils.py
github_to_sqlite/tokenization.py
github_to_sqlite/simple_chunker.py
github_to_sqlite/sentencizer_chunker.py
github_to_sqlite/config.py
Implement 'starred-embeddings' command and embeddings pipeline
  • Add generate_starred_embeddings in utils to fetch stars, chunk readmes, infer vectors, and upsert embedding tables
  • Introduce starred-embeddings CLI command and migrate command in cli.py
  • Update ensure_db_shape to call ensure_embedding_tables for migration support
github_to_sqlite/utils.py
github_to_sqlite/cli.py
github_to_sqlite/embedding_utils.py
Improve build file search with error warnings and pattern fallback
  • find_build_files now uses fd/find with warnings on errors and falls back to os.walk
  • Support custom glob patterns and normalize paths in _post_process_build_files
  • Deduplicate and filter junk paths before returning results
github_to_sqlite/build_files.py
Switch to AutoTokenizer for loading tokenizers
  • Add load_tokenizer using AutoTokenizer.from_pretrained with environment fallback
  • Raise clear RuntimeError on tokenizer load failures
  • Add tests for tokenizer loading and error scenarios
github_to_sqlite/tokenization.py
tests/test_config.py
Cache sqlite-vec extension loading per connection
  • Implement _maybe_load_sqlite_vec with per-connection caching
  • Use cached result in ensure_embedding_tables to create virtual or regular tables
  • Add tests for sqlite-vec load success and failure
github_to_sqlite/embedding_utils.py
tests/test_utils_functions.py
tests/test_embedding_tables.py
Enhance README chunking with semantic and blank-line fallback
  • Add chunk_readme to utils to use StatisticalChunker when available
  • Fallback to splitting on blank lines without optional dependency
  • Add tests for chunk_readme behavior
github_to_sqlite/utils.py
tests/test_chunk_readme.py
Extend dependencies and CI workflows for testing, linting, and docs
  • Update setup.py with extras_require for test, docs, semantic_chunkers, vector-search, and GPU
  • Modify GitHub Actions to install docs deps, run ruff, mypy, and build Sphinx docs
  • Adjust CI commands to include coverage flags and linting
setup.py
.github/workflows/test.yml
Expand test suite to cover new modules and commands
  • Add unit and integration tests for starred-embeddings, build file discovery, migration, and chunkers
  • Update existing tests to include embedding and metadata tables
  • Ensure coverage runs for new code paths
tests/

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@JeffCarpenter JeffCarpenter marked this pull request as ready for review August 4, 2025 11:29
Copilot AI review requested due to automatic review settings August 4, 2025 11:29
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive embeddings feature for starred repositories while improving tokenizer handling and refactoring utility code. The changes enable generation and storage of sentence-transformer embeddings for repository metadata, with proper chunking of README content and build file metadata extraction.

  • Add starred-embeddings CLI command for computing repository embeddings using sentence-transformers
  • Refactor utilities into focused modules (build files, embeddings, tokenization, chunking)
  • Switch to AutoTokenizer for broader model compatibility and add better error handling

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_workflows.py Remove unused import for ForeignKey
tests/test_stargazers.py Remove unused import for ForeignKey
tests/test_repos.py Remove unused import and duplicate pytest import
tests/*.py (new files) Add comprehensive test coverage for new modules and CLI commands
setup.py Add new dependency groups for testing, vector search, chunking, and documentation
mypy.ini Add mypy configuration for type checking
github_to_sqlite/utils.py Add embedding generation function and refactor imports from new modules
github_to_sqlite/*.py (new modules) Add modular code for tokenization, chunking, build files, and embedding management
docs/*.rst Add Sphinx documentation for embeddings and migrations
.github/workflows/test.yml Update CI to run additional tools and build documentation
README.md Add documentation for new migration and embedding features
Comments suppressed due to low confidence (1)

@@ -2,11 +2,9 @@
import pytest
Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate import statement for pytest. Remove one of the pytest imports as it's imported on line 2.

Copilot uses AI. Check for mistakes.
repos_seen.add(pr_repo_url)
utils.save_pull_requests(db, [pull_request], pr_repo)
else:
from typing import Iterable, Any
Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Import statement inside function body. Consider moving this import to the top of the file with other imports for better code organization.

Suggested change
from typing import Iterable, Any

Copilot uses AI. Check for mistakes.

def stop_when(commit):
from typing import Callable

Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Import statement inside function body. Consider moving this import to the top of the file with other imports for better code organization.

Suggested change

Copilot uses AI. Check for mistakes.
Style = _Style

from typing import TYPE_CHECKING

Copy link

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] TYPE_CHECKING import should be grouped with other typing imports at the top of the file for better organization.

Suggested change

Copilot uses AI. Check for mistakes.
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:

  • There are still many print statements sprinkled throughout the core functions—consider switching to the standard logging module so verbosity and error reporting can be controlled more flexibly.
  • The utils.py file has grown very large again; think about breaking out remaining HTTP/repo helpers into their own modules to keep each file’s responsibilities focused.
  • CLI code is repetitively casting tables via cast(sqlite_utils.db.Table, …); you could introduce a small helper to fetch a typed table instance and DRY up those calls.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There are still many print statements sprinkled throughout the core functions—consider switching to the standard logging module so verbosity and error reporting can be controlled more flexibly.
- The utils.py file has grown very large again; think about breaking out remaining HTTP/repo helpers into their own modules to keep each file’s responsibilities focused.
- CLI code is repetitively casting tables via cast(sqlite_utils.db.Table, …); you could introduce a small helper to fetch a typed table instance and DRY up those calls.

## Individual Comments

### Comment 1
<location> `github_to_sqlite/simple_chunker.py:135` </location>
<code_context>
+
+    def _chunk(self, sentences: List[str]) -> List[Chunk]:
+        chunks: List[Chunk] = []
+        for i in range(0, len(sentences), self.target_length):
+            piece = sentences[i : i + self.target_length]
+            if len(piece) < self.target_length:
</code_context>

<issue_to_address>
The chunking logic skips the final chunk if it is shorter than target_length.

Yield the final chunk even if it's shorter than target_length to avoid missing content.
</issue_to_address>

### Comment 2
<location> `github_to_sqlite/utils.py:1053` </location>
<code_context>
+                pk=("repo_id", "chunk_index"),
+            )
+
+        for build_path in find_build_files(repo["full_name"], patterns=patterns):
+            metadata = parse_build_file(os.path.join(repo["full_name"], build_path))
+            cast(sqlite_utils.db.Table, db["repo_build_files"]).upsert(
</code_context>

<issue_to_address>
find_build_files is called with repo["full_name"], which may not be a local path.

If repo["full_name"] is a GitHub identifier like 'owner/repo', it may not map to a local directory, which could cause find_build_files to fail. Please verify that a valid local path is used.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.


def _chunk(self, sentences: List[str]) -> List[Chunk]:
chunks: List[Chunk] = []
for i in range(0, len(sentences), self.target_length):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): The chunking logic skips the final chunk if it is shorter than target_length.

Yield the final chunk even if it's shorter than target_length to avoid missing content.

pk=("repo_id", "chunk_index"),
)

for build_path in find_build_files(repo["full_name"], patterns=patterns):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): find_build_files is called with repo["full_name"], which may not be a local path.

If repo["full_name"] is a GitHub identifier like 'owner/repo', it may not map to a local directory, which could cause find_build_files to fail. Please verify that a valid local path is used.

Comment on lines +17 to +18
if pattern == 'package.json':
raise subprocess.CalledProcessError(1, args)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +41 to +42
if cmd == 'find':
return '/usr/bin/find'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +48 to +49
if pattern == 'package.json':
raise subprocess.CalledProcessError(1, args)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +81 to +82
if "repo_metadata" not in tables:
db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines 72 to 74
for full in pathlib.Path(path).rglob(pattern):
if full.is_file():
found.append(str(full))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace a for append loop with list extend (for-append-to-extend)

Suggested change
for full in pathlib.Path(path).rglob(pattern):
if full.is_file():
found.append(str(full))
found.extend(
str(full)
for full in pathlib.Path(path).rglob(pattern)
if full.is_file()
)

from bs4 import BeautifulSoup

url = "https://github.com/{}/network/dependents".format(repo)
url: str | None = "https://github.com/{}/network/dependents".format(repo)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)

Suggested change
url: str | None = "https://github.com/{}/network/dependents".format(repo)
url: str | None = f"https://github.com/{repo}/network/dependents"

)


def generate_starred_embeddings(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Low code quality found in generate_starred_embeddings - 15% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

calls = []

def fake_which(cmd):
if cmd == 'find':
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Copilot AI and others added 3 commits September 28, 2025 17:35
Co-authored-by: JeffCarpenter <178424+JeffCarpenter@users.noreply.github.com>
Co-authored-by: JeffCarpenter <178424+JeffCarpenter@users.noreply.github.com>
Co-authored-by: JeffCarpenter <178424+JeffCarpenter@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments