pure-tokenizers

CGo-free tokenizers for Go with automatic library management and HuggingFace Hub integration.

✅ No CGo required - Pure Go implementation using purego FFI
✅ HuggingFace Hub integration - Load tokenizers directly from HuggingFace models
✅ Automatic downloads - Platform-specific libraries fetched on demand
✅ Cross-platform - Windows, macOS, Linux (including ARM)
✅ Production ready - Checksum verification and ABI compatibility checks

Quick Start

Load directly from HuggingFace Hub

package main

import (
    "fmt"
    "log"

    "github.com/amikos-tech/pure-tokenizers"
)

func main() {
    // Load tokenizer directly from HuggingFace model
    tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased")
    if err != nil {
        log.Fatal(err)
    }
    defer tokenizer.Close()

    // Tokenize text
    encoding, err := tokenizer.Encode("Hello, world!", tokenizers.WithAddSpecialTokens())
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println("Tokens:", encoding.Tokens)
    fmt.Println("Token IDs:", encoding.IDs)
}

Or load from a local file

// Load tokenizer from file
tokenizer, err := tokenizers.FromFile("tokenizer.json")
if err != nil {
    log.Fatal(err)
}
defer tokenizer.Close()

That's it! The library automatically downloads the correct binary for your platform on first use.

Installation

go get github.com/amikos-tech/pure-tokenizers

Features

🚀 Zero Configuration

The library automatically manages platform-specific binaries. No manual downloads, no build steps, no CGo.

🔐 Secure by Default

SHA256 checksum verification for all downloads
ABI version compatibility checking
Secure HTTPS-only downloads

🎯 Platform Native

Optimized binaries for each platform and architecture:

macOS (Intel & Apple Silicon)
Linux (x86_64, ARM64, including musl)
Windows (x86_64)

⚡ High Performance

Native Rust performance without CGo overhead. Direct FFI calls using purego.

Performance Benchmarks

The following benchmarks compare pure-tokenizers (CGo-free) with CGo-based implementations. Results show competitive performance while maintaining the benefits of a CGo-free approach.

Benchmark Comparison

Test Environment:

pure-tokenizers: Apple M3 Max, macOS (CGo-free implementation)
CGo baseline: Apple M1 Pro, macOS (daulet/tokenizers)

Note: Different hardware affects absolute timings. Focus on relative performance patterns and memory characteristics rather than exact microsecond differences.

Text Characteristics:

Short: <50 characters (typical word or phrase)
Medium: 100-500 characters (typical sentence or paragraph)
Long: >1000 characters (multiple paragraphs)

Operation	Implementation	Time/op	Memory/op	Allocs/op	Notes
Encode (Short Text)	pure-tokenizers	7.80μs	920 B	16	CGo-free
	CGo baseline	10.50μs	256 B	12	HuggingFace tokenizer
Encode (Medium Text)	pure-tokenizers	30.50μs	1,552 B	35	CGo-free
Encode (Long Text)	pure-tokenizers	267.00μs	6,864 B	165	CGo-free
Decode Operations	pure-tokenizers	13.40μs	740 B	10	CGo-free
	CGo baseline	1.50μs	64 B	2	HuggingFace tokenizer
Encode/Decode Cycle	pure-tokenizers	52.50μs	2,296 B	45	Medium text, CGo-free

Key Performance Characteristics

✅ Advantages of CGo-free approach:

No CGo overhead: Eliminates C-Go boundary crossing costs
Cross-compilation friendly: No CGo dependencies simplify building
Memory safety: Pure Go memory management
Deployment simplicity: Single binary with automatic library management

📊 Performance Analysis:

Encoding performance: Competitive with CGo implementations, often faster for short texts
Memory usage: Higher allocation count due to FFI boundary (16 vs 12 allocs), but predictable patterns
Batch processing: Efficient handling of multiple text inputs
Platform consistency: Consistent performance across all supported platforms

Advanced Benchmarks

Feature	Time/op	Memory/op	Allocs/op	Notes
Batch Processing (5 texts)	356.00μs	11,568 B	261	Parallel encoding
With Options (all attributes)	34.30μs	2,160 B	41	Full feature set
Truncation (128 tokens)	258.00μs	5,632 B	127	Max length enforcement
Padding (256 tokens)	84.90μs	16,272 B	535	Fixed length output
HuggingFace Loading (cached)	26.20ms	6.45 MB	92,188	Model initialization

Benchmark Environment

# Run benchmarks locally
make build && go test -bench=. -benchmem

# Compare with different tokenizers
go test -bench=BenchmarkEncode -benchmem
go test -bench=BenchmarkDecode -benchmem

Platform-specific results: Benchmarks run continuously in CI across Linux, macOS, and Windows. See benchmark workflow for automated performance tracking.

Usage Examples

HuggingFace Hub Integration

// Load tokenizer from any public HuggingFace model
tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased")
tokenizer, err := tokenizers.FromHuggingFace("gpt2")
tokenizer, err := tokenizers.FromHuggingFace("sentence-transformers/all-MiniLM-L6-v2")

// Load from private/gated models with authentication
tokenizer, err := tokenizers.FromHuggingFace("meta-llama/Llama-2-7b-hf",
    tokenizers.WithHFToken(os.Getenv("HF_TOKEN")))

// Configure HuggingFace options
tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased",
    tokenizers.WithHFToken(token),           // Authentication token
    tokenizers.WithHFRevision("main"),       // Specific revision/branch
    tokenizers.WithHFCacheDir("/custom/cache"), // Custom cache directory
    tokenizers.WithHFTimeout(30*time.Second),   // Download timeout
    tokenizers.WithHFOfflineMode(true),      // Use cached version only
)

// The tokenizer is automatically cached for offline use
// Cache location: ~/.cache/tokenizers/huggingface/ (Linux/macOS)
//                 %APPDATA%/tokenizers/huggingface/ (Windows)

📚 See also:

HuggingFace Integration Guide - Comprehensive documentation
Example: Basic Usage - Loading various models
Example: Cache Management - Working with cache
Example: Private Models - Authentication and gated models

Basic Tokenization

// Load a tokenizer from file
tokenizer, err := tokenizers.FromFile("path/to/tokenizer.json")
if err != nil {
    log.Fatal(err)
}
defer tokenizer.Close()

// Simple encoding
encoding, err := tokenizer.Encode("Hello, world!")

// With special tokens
encoding, err := tokenizer.Encode("Hello, world!", tokenizers.WithAddSpecialTokens())

Advanced Options

// Encoding with custom options
encoding, err := tokenizer.Encode("Your text here",
    tokenizers.WithAddSpecialTokens(),
    tokenizers.WithReturnTokens(),
    tokenizers.WithReturnAttentionMask(),
    tokenizers.WithReturnTypeIDs(),
)

// Create tokenizer with truncation and padding
tokenizer, err := tokenizers.FromFile("tokenizer.json",
    tokenizers.WithTruncation(512, tokenizers.TruncationDirectionRight, tokenizers.TruncationStrategyLongestFirst),
    tokenizers.WithPadding(true, tokenizers.PaddingStrategy{Tag: tokenizers.PaddingStrategyFixed, FixedSize: 512}),
)

// Access different parts of the encoding result
if encoding.Tokens != nil {
    fmt.Println("Tokens:", encoding.Tokens)
}
if encoding.IDs != nil {
    fmt.Println("Token IDs:", encoding.IDs)
}
if encoding.AttentionMask != nil {
    fmt.Println("Attention mask:", encoding.AttentionMask)
}

Decoding Tokens

// Decode token IDs back to text
ids := []uint32{101, 7592, 1010, 2088, 999, 102}
text, err := tokenizer.Decode(ids, true)
fmt.Println(text)  // "hello, world!"

Loading from Configuration Files

// Load tokenizer from a downloaded tokenizer.json file
tokenizer, err := tokenizers.FromFile("path/to/tokenizer.json")

// Load from byte configuration
configBytes, _ := os.ReadFile("tokenizer.json")
tokenizer, err := tokenizers.FromBytes(configBytes)

// Use with custom library path
tokenizer, err := tokenizers.FromFile("tokenizer.json",
    tokenizers.WithLibraryPath("/custom/path/to/libtokenizers.so"))

Configuration

Environment Variables

Variable	Description	Default
`TOKENIZERS_LIB_PATH`	Custom library path	Auto-detect
`TOKENIZERS_GITHUB_REPO`	GitHub repo for downloads	`amikos-tech/pure-tokenizers`
`TOKENIZERS_VERSION`	Library version to download	`latest`
`GITHUB_TOKEN`	GitHub API token (for rate limits)	None

Library Loading Options

// Use a specific library path
tokenizer, err := tokenizers.FromFile("tokenizer.json",
    tokenizers.WithLibraryPath("/custom/path/to/libtokenizers.so"))

// The library loading priority:
// 1. User-provided path via WithLibraryPath()
// 2. TOKENIZERS_LIB_PATH environment variable
// 3. Cached library in platform directory
// 4. Automatic download from GitHub releases

Cache Management

For comprehensive cache management documentation, see Cache Management Guide.

Library Cache

// Get the library cache directory
cachePath := tokenizers.GetCachedLibraryPath()

// Clear the library cache
err := tokenizers.ClearLibraryCache()

// Download and cache a specific version
err := tokenizers.DownloadAndCacheLibraryWithVersion("v0.1.0")

HuggingFace Cache

// Get HuggingFace cache information
info, err := tokenizers.GetHFCacheInfo("bert-base-uncased")

// Clear cache for a specific model
err := tokenizers.ClearHFModelCache("bert-base-uncased")

// Clear entire HuggingFace cache
err := tokenizers.ClearHFCache()

// Use offline mode (only use cached tokenizers)
tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased",
    tokenizers.WithHFOfflineMode(true))

Platform Support

Platform	Architecture	Binary	Status
macOS	x86_64	`.dylib`	✅
macOS	aarch64 (M1/M2)	`.dylib`	✅
Linux	x86_64	`.so`	✅
Linux	aarch64	`.so`	✅
Linux (musl)	x86_64	`.so`	✅
Linux (musl)	aarch64	`.so`	✅
Windows	x86_64	`.dll`	✅

Development

Building from Source

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone the repository
git clone https://github.com/amikos-tech/pure-tokenizers
cd pure-tokenizers

# Build the Rust library
make build

# Run tests
make test

# Run linting
make lint-fix      # Go linting
make lint-rust     # Rust linting

Testing

Unit Tests

# Run all unit tests
make test

# Run with specific library path
make test-lib-path

Integration Tests

Integration tests verify real-world functionality with HuggingFace models:

# Setup for local testing
cp .env.example .env
# Edit .env and add your HF_TOKEN (get from https://huggingface.co/settings/tokens)

# Run all integration tests (requires HF_TOKEN for private models)
make test-integration

# Run only HuggingFace integration tests
make test-integration-hf

The integration tests cover:

Public model downloads (BERT, GPT2, DistilBERT)
Private model access (with HF_TOKEN)
Caching behavior verification
Rate limiting handling
Offline mode functionality

Note: Integration tests are automatically run in CI for the main branch and PRs with the integration label.

Project Structure

pure-tokenizers/
├── src/           # Rust FFI implementation
├── *.go           # Go bindings
├── download.go    # Auto-download functionality
├── library.go     # Platform-specific FFI loading
└── Makefile       # Build automation

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'feat: add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on top of the excellent Hugging Face Tokenizers library.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.claude/commands		.claude/commands
.github		.github
.kiro/steering		.kiro/steering
.vscode		.vscode
docs		docs
examples		examples
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
abi_test.go		abi_test.go
abi_version.json		abi_version.json
benchmark_test.go		benchmark_test.go
decode_test.go		decode_test.go
download.go		download.go
go.mod		go.mod
go.sum		go.sum
huggingface.go		huggingface.go
huggingface_cache_test.go		huggingface_cache_test.go
huggingface_failure_test.go		huggingface_failure_test.go
huggingface_integration_test.go		huggingface_integration_test.go
huggingface_suite_test.go		huggingface_suite_test.go
huggingface_test.go		huggingface_test.go
library.go		library.go
library_loading.go		library_loading.go
library_test.go		library_test.go
library_windows.go		library_windows.go
tokenizer.json		tokenizer.json
tokenizers.go		tokenizers.go
tokenizers_test.go		tokenizers_test.go
utils.go		utils.go
utils_test.go		utils_test.go

License

amikos-tech/pure-tokenizers

Folders and files

Latest commit

History

Repository files navigation

pure-tokenizers

Quick Start

Load directly from HuggingFace Hub

Or load from a local file

Installation

Features

🚀 Zero Configuration

🔐 Secure by Default

🎯 Platform Native

⚡ High Performance

Performance Benchmarks

Benchmark Comparison

Key Performance Characteristics

Advanced Benchmarks

Benchmark Environment

Usage Examples

HuggingFace Hub Integration

Basic Tokenization

Advanced Options

Decoding Tokens

Loading from Configuration Files

Configuration

Environment Variables

Library Loading Options

Cache Management

Library Cache

HuggingFace Cache

Platform Support

Development

Building from Source

Testing

Unit Tests

Integration Tests

Project Structure

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 2

Uh oh!

Languages

Packages