Skip to content

BitNet-inspired quantization-aware training and model compiler for running neural networks efficiently on ESP32 devices.

License

Notifications You must be signed in to change notification settings

Aizhee/python-bitneural32

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BitNeural32: 1.58-Bit Ternary Neural Network Compiler for ESP32

PyPI License: MIT Python 3.9+

A Python library for training, quantizing, and compiling neural networks to ultra-efficient 1.58-bit (ternary) format for deployment on ESP32 microcontrollers.

See also: BitNeural32 Inference Library

Features

1.58-Bit Quantization: Extreme compression—weights packed as 2-bit values (4 weights per byte) using ternary {-1, 0, 1}

Quantization-Aware Training (QAT): Custom Keras layers that apply quantization during training for better post-export accuracy

Production-Ready Compiler: Convert Keras models to optimized C bytecode with automatic weight flattening, packing, and metadata generation

Inference Metrics: Estimate inference time, RAM usage, and Flash size for different ESP32 variants (ESP32, ESP32-S3, ESP32-C3)

15+ Layer Types: Dense, Conv1D, Conv2D, LSTM, GRU, ReLU, LeakyReLU, Softmax, Sigmoid, Tanh, MaxPooling1D, Flatten, Dropout, and more

Type Safe: Full Python 3.9+ support with comprehensive type hints

Installation

From PyPI (recommended)

pip install bitneural32

Requirements

  • Python: 3.9 or higher
  • Keras: 3.0+
  • TensorFlow: 2.16+ (or standalone Keras 3.x)
  • NumPy: 1.21+

Quick Start

1. Train with Quantization-Aware Training (Recommended)

import numpy as np
import keras
from bitneural32.qat import TernaryDense, TernaryConv1D

# Build a QAT model
model = keras.Sequential([
    TernaryConv1D(filters=32, kernel_size=5, padding='same', input_shape=(100, 1)),
    keras.layers.ReLU(),
    keras.layers.MaxPooling1D(2),
    keras.layers.Flatten(),
    TernaryDense(64),
    keras.layers.ReLU(),
    TernaryDense(10, activation='softmax')
])

# Train normally—quantization happens automatically
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
X_train = np.random.randn(1000, 100, 1).astype('float32')
Y_train = keras.utils.to_categorical(np.random.randint(0, 10, 1000), 10)
model.fit(X_train, Y_train, epochs=10, batch_size=32, verbose=1)

# Save for export
model.save('qat_model.keras')

2. Compile to ESP32 Bytecode

from bitneural32.compiler import BitNeuralCompiler

# Load and compile
compiler = BitNeuralCompiler(board_type='ESP32-S3')
compiled_model = keras.models.load_model('qat_model.keras')
compiler.compile_model(compiled_model, input_data=X_train)
compiler.save_c_header('model_data.h', include_metrics=True)

# View metrics
report = compiler.get_compilation_report()
print(report)

Output example:

{
  "board_type": "ESP32-S3",
  "total_size_bytes": 24576,
  "num_layers": 8,
  "inference_time_ms": 12.5,
  "ram_usage_bytes": 1024,
  "total_macs": 2500000,
  "layers": [...]
}

3. Run on ESP32

Learn more at Deployment Guide

API Reference

QAT Layers

All custom QAT layers support standard Keras layer interfaces and compile seamlessly:

TernaryDense(units, **kwargs)

Fully-connected layer with ternary quantization.

layer = TernaryDense(64, activation='relu')

TernaryConv1D(filters, kernel_size, strides=1, padding='same', **kwargs)

1D convolution optimized for single-channel inputs (e.g., time-series).

layer = TernaryConv1D(32, kernel_size=5, padding='same')

TernaryConv2D(filters, kernel_size, strides=1, padding='same', **kwargs)

2D convolution supporting multi-channel inputs and outputs.

layer = TernaryConv2D(16, kernel_size=3, padding='same')

TernaryLSTM(units, return_sequences=False, **kwargs)

LSTM recurrent layer with quantized weights and float32 biases.

layer = TernaryLSTM(32, return_sequences=True)

TernaryGRU(units, return_sequences=False, **kwargs)

GRU recurrent layer with quantized weights and float32 biases.

layer = TernaryGRU(32, return_sequences=False)

Compiler API

BitNeuralCompiler(model=None, board_type='ESP32')

Parameters:

  • board_type (str): Target ESP32 variant ('ESP32', 'ESP32-S3', 'ESP32-C3')

Methods:

  • compile_model(model, input_data=None, allow_metrics=False): Compile a Keras model
  • save_c_header(filepath, include_metrics=False): Export to C header file
  • get_compilation_report(): Get human-readable report (dict)
  • export_model(filepath, allow_metrics=False): Convenience export function

Example:

compiler = BitNeuralCompiler(board_type='ESP32-S3')
compiler.compile_model(model, input_data=X_train, allow_metrics=True)
compiler.save_c_header('model.h', include_metrics=True)

Quantization Utilities

quantize_weights_ternary(weights)

Quantize float32 weights to {-1, 0, 1} using median-based thresholding.

from bitneural32.quantize import quantize_weights_ternary
quantized = quantize_weights_ternary(np.random.randn(100, 100))

pack_weights_2bit(quantized_weights)

Pack ternary weights into 2-bit format (4 weights per byte).

from bitneural32.quantize import pack_weights_2bit
packed = pack_weights_2bit(quantized)

Architecture Overview

Quantization Strategy

BitNeural32 uses ternary quantization:

  1. Median-based thresholding: Set threshold = median(|weights|)
  2. Ternary encoding:
    • Weight > threshold → 1
    • Weight < -threshold → -1
    • Otherwise → 0
  3. 2-bit packing: 4 weights per byte (2 bits each)

Encoding:

  • 00 → 0
  • 01 → 1
  • 10 → -1
  • 11 → reserved

QAT Training

Quantization-aware training applies quantization in-the-loop:

  1. Forward pass: Weights quantized to {-1, 0, 1} with learnable scale
  2. Backward pass: Straight-through estimator (STE) for gradient computation
  3. Result: Network adapts to quantization → 2-5% higher accuracy after export

Compilation Pipeline

Keras Model
    ↓
[Per-Layer Compilation]
    ↓
Weight Flattening (layer-specific order)
    ↓
Ternary Quantization + 2-Bit Packing
    ↓
Binary Blob Generation
    ↓
C Header Export
    ↓
model_data.h (ready for ESP32 inclusion)

Performance Characteristics

Memory Footprint

Example: 10→64→32→10 network

Format Size
Float32 40 KB
Ternary (1.58-bit) 2.5 KB
Compression 94%

Inference Speed (ESP32 @ 240 MHz)

Layer Type Input→Output Approx. Time
Dense 1000→1000 10-50 ms
Conv1D 100 inputs, 32 filters, kernel 5 5-20 ms
Conv2D 28×28→14×14, 32 filters 20-100 ms
LSTM 32 hidden, 50 timesteps 15-80 ms
Full Network 10→64→32→10 1-5 ms

Supported Layers

Layer QAT Version Notes
Dense TernaryDense ✅ Full support
Conv1D TernaryConv1D ✅ Mono-channel optimized
Conv2D TernaryConv2D ✅ Multi-channel support
LSTM TernaryLSTM ✅ Quantized kernel & recurrent
GRU TernaryGRU ✅ Quantized kernel & recurrent
ReLU Standard ✅ No quantization needed
LeakyReLU Standard ✅ Works as-is
Softmax Standard ✅ Uses float32 for stability
Sigmoid Standard ✅ Fast Padé approximation on ESP32
Tanh Standard ✅ Fast Padé approximation on ESP32
MaxPooling1D Standard ✅ No quantization
Flatten Standard ✅ Memory layout only
Dropout Standard ✅ No-op at inference

Tips & Best Practices

Model Design

  • Start with QAT layers for better accuracy after quantization
  • Use smaller models: Ternary networks benefit from depth over width
  • Avoid BatchNormalization before quantized layers (fuse into weights)
  • Use ReLU/LeakyReLU for better quantization robustness

Training

  • Learning rate: Use 10× lower LR than standard training
  • Epochs: Train 20-50% longer to adapt to quantization
  • Batch size: 32-128 works well for most models
  • Monitor accuracy: QAT models may drop 1-3% initially, then recover

Compilation

  • Always provide input_data: Needed for input normalization statistics
  • Check metrics: Use allow_metrics=True to estimate ESP32 performance
  • Board selection: ESP32-S3 has more RAM; ESP32-C3 is power-efficient

Deployment

  • Test on target hardware: Simulator timings differ from real ESP32
  • Use dual-core: Enable Core 1 for real-time audio/sensor processing
  • Monitor UART: Check inference logs for bottlenecks

Troubleshooting

"Unsupported layer type"

Make sure you're using QAT versions or standard Keras layers. If custom layer:

# Add to compiler mapping
from bitneural32.compiler import BitNeuralCompiler
BitNeuralCompiler.LAYER_COMPILER_MAP['MyLayer'] = MyLayerCompiler()

Model accuracy drops significantly after quantization

  • Use QAT layers instead of post-training quantization
  • Train longer (2-3× epochs)
  • Lower learning rate by 10×
  • Use warm-up training (standard float → gradual quantization)

Compiled model is too large

  • Reduce model size (fewer filters/units)
  • Use depthwise separable convolutions
  • Remove dense layers, use global pooling instead
  • Prune weights before compilation

ESP32 inference is slow

  • Check clock speed (set to 240 MHz max)
  • Profile with bn_run_inference() timing
  • Use Conv1D instead of Dense for temporal data
  • Consider smaller input resolution

Citation

If you use BitNeural32 in your research, please cite:

@software{bitneural32,
  title = {BitNeural32: 1.58-Bit Ternary Neural Network Compiler for ESP32},
  author = {Aizhee},
  year = {2025},
  url = {https://github.com/aizhee/python-bitneural32}
}

License

MIT License - See LICENSE file for details.

References


Made with ❤️ by Aizhee for embedded machine learning

ko-fi

About

BitNet-inspired quantization-aware training and model compiler for running neural networks efficiently on ESP32 devices.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages