MareArts Computer Vision Study.: GPU

Showing posts with label GPU. Show all posts

12/30/2025

MareArts ANPR V14 Models - Complete Performance Guide & Benchmarks

⚡ MareArts ANPR V14 Models - Performance, Metrics & How to Choose

Choosing the right ANPR model is crucial for your application. Too heavy? Slow performance. Too light? Lower accuracy. In this comprehensive guide, we'll break down all MareArts ANPR V14 models with real benchmarks to help you make the perfect choice.

🎯 Two-Stage Pipeline Architecture

MareArts ANPR uses a two-stage pipeline:

Detector - Finds license plates in images (Where is the plate?)
OCR - Reads text from detected plates (What does it say?)

You can mix and match models from each stage to optimize for your specific needs!

📊 Detector Models - Find License Plates

Model Sizes Explained

Size	Parameters	Speed	Accuracy	Best For
pico	Smallest	Fast	Good (96-98%)	Mobile, Edge devices
micro	Small	Very Fast	Excellent (97-99%)	🏆 Best overall
small	Medium	Fastest	Excellent (98-99%)	High-speed applications
medium	Large	Fast	Excellent (98-99%)	Balanced
large	Largest	Moderate	Highest (99%+)	Maximum accuracy

Resolution Options

320p models (320×320) - 2× faster, 96-98% detection
640p models (640×640) - Highest accuracy, 98-99% detection

Precision Options

FP32 - Fastest on GPU (2× faster than FP16), standard size
FP16 - 50% smaller file size, same accuracy, slower inference

Complete Detector Performance Table

Model Name	Detection Rate	Speed (GPU)	Size	Recommendation
micro_320p_fp32	97.13%	128 FPS (7.8ms)	83 MB	🏆 Best overall
micro_320p_fp16	97.13%	56 FPS (17.9ms)	42 MB	🏆 Best mobile
small_320p_fp32	98.00%	142 FPS (7.0ms)	114 MB	⚡ Fastest
medium_320p_fp32	98.06%	136 FPS (7.4ms)	153 MB	High detection
large_320p_fp32	98.40%	131 FPS (7.6ms)	164 MB	Strong performance
pico_320p_fp32	96.02%	129 FPS (7.8ms)	75 MB	📱 Smallest + fast
pico_640p_fp32	98.54%	66 FPS (15.2ms)	75 MB	Balanced
small_640p_fp32	99.15%	70 FPS (14.3ms)	114 MB	High detection
medium_640p_fp32	99.21%	66 FPS (15.1ms)	153 MB	Very high
large_640p_fp32	99.31%	60 FPS (16.7ms)	164 MB	🎯 Highest accuracy

Key Findings:

✅ 320p models: 2× faster than 640p (96-98% accuracy)
✅ 640p models: Highest accuracy (98-99%) for difficult cases
✅ FP16 models: 50% smaller, same accuracy, ~50% slower
✅ Recommended: micro_320p_fp32 (best speed/accuracy balance)

📖 OCR Models - Read License Plate Text

Two Key Metrics

Exact Match - Entire plate number is 100% correct
Character Accuracy - Percentage of individual characters correct

Example: Actual plate: "ABC-1234"

OCR reads "ABC-1234" → ✅ Exact Match = Yes, Char Accuracy = 100%
OCR reads "ABC-1235" → ❌ Exact Match = No, Char Accuracy = 87.5% (7/8 correct)

Complete OCR Performance by Region

🌍 Universal (univ) - All Regions

Model	Exact Match	Char Accuracy	FPS	Size
pico_fp32	97.48%	98.87%	264	20 MB
micro_fp32	97.54%	98.86%	260	71 MB
small_fp32	97.51%	98.85%	291	112 MB
medium_fp32	97.57%	98.89%	245	164 MB
large_fp32	97.75%	98.91%	253	179 MB

🇰🇷 Korean (kr) - Best Overall Accuracy

Model	Exact Match	Char Accuracy	FPS
pico_fp32	98.99%	99.77%	272
micro_fp32	99.21%	99.80%	250
small_fp32	99.19%	99.80%	295
medium_fp32	99.21%	99.80%	267
large_fp32	99.27%	99.82%	265

🇪🇺 Europe+ (eup) - EU + Additional Countries

Model	Exact Match	Char Accuracy	FPS
pico_fp32	94.98%	97.39%	280
micro_fp32	95.07%	97.46%	266
small_fp32	94.98%	97.43%	304
medium_fp32	95.03%	97.46%	278
large_fp32	95.32%	97.54%	260

🇺🇸 North America (na) - USA, Canada, Mexico

Model	Exact Match	Char Accuracy	FPS
pico_fp32	71.21%	88.43%	268
micro_fp32	71.21%	87.67%	269
small_fp32	69.70%	88.27%	311
medium_fp32	63.64%	87.24%	284
large_fp32	69.70%	86.25%	271

🇨🇳 China (cn)

Model	Exact Match	Char Accuracy	FPS
pico_fp32	96.24%	98.82%	268
micro_fp32	96.30%	98.74%	265
small_fp32	96.36%	98.88%	301
medium_fp32	96.36%	98.89%	276
large_fp32	96.49%	98.87%	262

OCR Model Averages (All Regions)

Model	Avg Exact Match	Avg Char Accuracy	Avg FPS	Size
small_fp32	91.54%	96.64%	300 FPS	112 MB
pico_fp32	91.78%	96.65%	270 FPS	20 MB
micro_fp32	91.86%	96.50%	262 FPS	71 MB
medium_fp32	90.36%	96.45%	270 FPS	164 MB
large_fp32	91.70%	96.27%	262 FPS	179 MB

🌍 Regional Vocabulary Support

Region	Code	Coverage	Character Sets
Universal	`univ`	All regions (default)	All character sets
Korea	`kr`	South Korea	Hangul + Latin + Digits
Europe+	`eup`	EU + UK, Switzerland, Norway	Latin + Cyrillic + Special
North America	`na`	USA, Canada, Mexico	Latin + Digits
China	`cn`	China	Chinese + Latin + Digits

Pro Tip: Always use specific regions for best accuracy. Only use univ when the region is unknown!

🎯 How to Choose the Right Models

Use Case 1: Parking Management

Requirements: Good accuracy, real-time performance, cost-effective

# Recommended Configuration
detector = ma_anpr_detector_v14(
    "micro_320p_fp32",  # 97% detection, 128 FPS
    user, key, sig,
    backend="cuda",
    conf_thres=0.25
)

ocr = ma_anpr_ocr_v14(
    "small_fp32",       # 95%+ exact match, 300 FPS
    "eup",              # Specific region
    user, key, sig
)

Why: Excellent balance of speed and accuracy. Handles 90%+ of plates easily.

Use Case 2: Security Checkpoint (Critical)

Requirements: Maximum accuracy, can't miss plates

# Recommended Configuration
detector = ma_anpr_detector_v14(
    "large_640p_fp32",  # 99.31% detection (highest!)
    user, key, sig,
    backend="cuda",
    conf_thres=0.20     # Lower threshold for more detections
)

ocr = ma_anpr_ocr_v14(
    "large_fp32",       # 95%+ exact match, best accuracy
    "kr",               # Specific region for your area
    user, key, sig
)

Why: Maximum detection and recognition accuracy. No compromises.

Use Case 3: Traffic Monitoring (High Volume)

Requirements: Maximum speed, process many cameras

# Recommended Configuration
detector = ma_anpr_detector_v14(
    "small_320p_fp32",  # 98% detection, 142 FPS (fastest!)
    user, key, sig,
    backend="cuda",
    conf_thres=0.25
)

ocr = ma_anpr_ocr_v14(
    "small_fp32",       # 300 FPS (fastest OCR!)
    "univ",             # Universal for mixed traffic
    user, key, sig
)

Why: Fastest processing for high-volume applications. Can handle multiple streams.

Use Case 4: Mobile/Edge Device

Requirements: Small size, low power, on-device processing

# Recommended Configuration
detector = ma_anpr_detector_v14(
    "micro_320p_fp16",  # 97% detection, 42 MB (50% smaller!)
    user, key, sig,
    backend="cpu",      # CPU for mobile
    conf_thres=0.25
)

ocr = ma_anpr_ocr_v14(
    "pico_fp32",        # 20 MB, 270 FPS
    "kr",               # Specific region
    user, key, sig
)

Why: Smallest models, excellent for mobile/edge. Total size: 62 MB.

Use Case 5: Law Enforcement (Difficult Conditions)

Requirements: Works in poor lighting, angles, damaged plates

# Recommended Configuration
detector = ma_anpr_detector_v14(
    "medium_640p_fp32", # 99.21% detection
    user, key, sig,
    backend="cuda",
    conf_thres=0.15     # Very low threshold for difficult cases
)

ocr = ma_anpr_ocr_v14(
    "large_fp32",       # Best OCR accuracy
    "na",               # Specific region
    user, key, sig
)

Why: Handles difficult conditions better. Lower threshold catches more plates.

📈 Performance Comparison Chart

Detector Models: Speed vs Accuracy

Category	Fastest	Balanced	Most Accurate
320p	small_320p_fp32 142 FPS, 98.00%	micro_320p_fp32 128 FPS, 97.13%	large_320p_fp32 131 FPS, 98.40%
640p	small_640p_fp32 70 FPS, 99.15%	medium_640p_fp32 66 FPS, 99.21%	large_640p_fp32 60 FPS, 99.31%
Mobile	pico_320p_fp16 50+ FPS, 37 MB	micro_320p_fp16 56 FPS, 42 MB	small_320p_fp16 70+ FPS, 57 MB

OCR Models: Speed vs Accuracy

Priority	Smallest	Fastest	Most Accurate
Choice	pico_fp32 20 MB, 270 FPS 91.78% exact	small_fp32 112 MB, 300 FPS 91.54% exact	large_fp32 179 MB, 262 FPS 91.70% exact

💡 Performance Tips

1. GPU Acceleration is Essential

# CPU: ~1-2 FPS (slow!)
detector = ma_anpr_detector_v14(..., backend="cpu")

# CUDA (NVIDIA GPU): ~100+ FPS (fast!)
detector = ma_anpr_detector_v14(..., backend="cuda")

# DirectML (Windows GPU): ~50+ FPS
detector = ma_anpr_detector_v14(..., backend="directml")

Result: GPU is 50-100× faster than CPU!

2. Use Batch Processing

# Slow: Process one by one
for img in images:
    text, conf = ocr.predict(img)

# Fast: Process in batch (3-5× faster!)
results = ocr.predict(images)  # Pass list

3. Choose Resolution Wisely

320p: Good quality images, controlled environment → Use 320p (2× faster)
640p: Poor lighting, far distance, damaged plates → Use 640p (higher accuracy)

4. Tune Confidence Thresholds

# High precision (fewer false positives)
detector = ma_anpr_detector_v14(..., conf_thres=0.50)

# Balanced (recommended)
detector = ma_anpr_detector_v14(..., conf_thres=0.25)

# High recall (catch more plates, more false positives)
detector = ma_anpr_detector_v14(..., conf_thres=0.15)

5. Use Specific Regions

# ❌ Less accurate (universal)
ocr = ma_anpr_ocr_v14("small_fp32", "univ", ...)  # ~92% exact match

# ✅ More accurate (specific region)
ocr = ma_anpr_ocr_v14("small_fp32", "kr", ...)    # ~99% exact match!

🚀 Quick Decision Guide

Your Priority	Detector	OCR
Best Overall	micro_320p_fp32	small_fp32
Fastest	small_320p_fp32	small_fp32
Most Accurate	large_640p_fp32	large_fp32
Smallest	pico_320p_fp16	pico_fp32
Mobile	micro_320p_fp16	pico_fp32
Balanced	medium_320p_fp32	medium_fp32

📊 Benchmark Environment

GPU: NVIDIA RTX 3060 (CUDA 11.8)
CPU: Intel Core i7
Dataset: Real-world license plate images
Test Size: 1000+ images per region
Updated: December 2025

🎓 Key Takeaways

✅ Two-stage pipeline: Detector → OCR
✅ Mix and match models for your needs
✅ 320p models: 2× faster, excellent for most uses
✅ 640p models: Highest accuracy for difficult cases
✅ GPU acceleration: 50-100× faster than CPU
✅ Specific regions: Much better accuracy than universal
✅ Batch processing: 3-5× faster for multiple images
✅ Best overall: micro_320p_fp32 + small_fp32

💻 Example Configuration

from marearts_anpr import ma_anpr_detector_v14, ma_anpr_ocr_v14
from marearts_anpr import marearts_anpr_from_image_file

# Initialize models (one time)
detector = ma_anpr_detector_v14(
    "micro_320p_fp32",      # 97% detection, 128 FPS
    user_name, serial_key, signature,
    backend="cuda",          # GPU acceleration
    conf_thres=0.25          # Balanced threshold
)

ocr = ma_anpr_ocr_v14(
    "small_fp32",            # 95%+ accuracy, 300 FPS
    "eup",                   # Specific region for best accuracy
    user_name, serial_key, signature
)

# Process image
result = marearts_anpr_from_image_file(detector, ocr, "plate.jpg")
print(result)

# Output:
# {
#   "results": [
#     {
#       "ocr": "AB-123-CD",
#       "ocr_conf": 98.5,
#       "ltrb": [120, 230, 380, 290],
#       "ltrb_conf": 95
#     }
#   ],
#   "ltrb_proc_sec": 0.008,  # Detection time
#   "ocr_proc_sec": 0.003     # OCR time
# }

🔗 Resources

📊 Full Benchmarks: See detailed results in GitHub docs
📚 Model Guide: Complete model documentation
🧪 Try Free: ma-anpr test-api image.jpg
🛒 Get License: MareArts ANPR

🎯 Conclusion

MareArts ANPR V14 offers 11 detector models and 5 OCR models, giving you 55+ possible combinations! The right choice depends on your specific requirements:

Speed-critical? → small_320p_fp32 + small_fp32
Accuracy-critical? → large_640p_fp32 + large_fp32
Balanced? → micro_320p_fp32 + small_fp32 (recommended!)
Mobile? → micro_320p_fp16 + pico_fp32

Start with the recommended configuration and tune based on your results. Happy optimizing! ⚡🚗

Labels: ANPR, MachineLearning, ComputerVision, Performance, Benchmarks, Models, Metrics, DeepLearning, Optimization, GPU

MareArts ANPR V14 - Advanced Manual Processing & Performance Tuning

⚡ MareArts ANPR V14 - Advanced Manual Processing

Ready to take control? In this advanced guide, I'll show you how to manually process detections, measure performance, and optimize for your specific use case.

🎯 Why Manual Processing?

Full control over detection pipeline
Custom filtering and post-processing
Performance measurement and optimization
Integration with existing computer vision pipelines
Custom confidence thresholds per stage

🔧 Manual Detection & OCR Pipeline

from marearts_anpr import ma_anpr_detector_v14, ma_anpr_ocr_v14
import cv2
from PIL import Image
import time

# Initialize models
detector = ma_anpr_detector_v14(
    "medium_640p_fp32",
    user_name, serial_key, signature,
    backend="cpu",
    conf_thres=0.25,
    iou_thres=0.5
)

ocr = ma_anpr_ocr_v14("medium_fp32", "eup", user_name, serial_key, signature)

# Load image
img = cv2.imread("plate.jpg")

# Step 1: Detect license plates
start = time.time()
detections = detector.detector(img)
detection_time = time.time() - start

print(f"Detection time: {detection_time:.4f}s")
print(f"Found {len(detections)} plate(s)")

# Step 2: Process each detection
results = []
ocr_time = 0

for i, box_info in enumerate(detections):
    # Get bounding box
    bbox = box_info['bbox']  # [x1, y1, x2, y2]
    score = box_info['score']  # Detection confidence
    
    # Crop plate region
    x1, y1, x2, y2 = int(bbox[0]), int(bbox[1]), int(bbox[2]), int(bbox[3])
    crop = img[y1:y2, x1:x2]
    
    if crop.size == 0:
        continue
    
    # Convert to PIL for OCR
    pil_img = Image.fromarray(crop)
    if pil_img.mode != "RGB":
        pil_img = pil_img.convert("RGB")
    
    # Run OCR
    start = time.time()
    text, confidence = ocr.predict(pil_img)
    elapsed = time.time() - start
    ocr_time += elapsed
    
    print(f"Plate {i+1}: {text} ({confidence}%) - {elapsed:.4f}s")
    
    results.append({
        "ocr": text,
        "ocr_conf": confidence,
        "bbox": [x1, y1, x2, y2],
        "det_conf": int(score * 100)
    })

print(f"\nTotal time: {detection_time + ocr_time:.4f}s")

📊 Detection Object Structure

# detector.detector(img) returns list of dictionaries:
[
    {
        'bbox': [x1, y1, x2, y2],  # Bounding box coordinates
        'score': 0.95,              # Detection confidence (0-1)
        'class': 'license_plate'    # Object class
    },
    ...
]

# ocr.predict(pil_image) returns tuple:
("ABC1234", 98.5)  # (text, confidence_percentage)

🚀 Backend Performance Comparison

backends = ["cpu", "cuda"]  # Add "directml" on Windows

for backend_name in backends:
    try:
        print(f"\n🔧 Testing {backend_name}...")
        
        # Initialize with specific backend
        test_detector = ma_anpr_detector_v14(
            "medium_640p_fp32",
            user_name, serial_key, signature,
            backend=backend_name,
            conf_thres=0.25
        )
        
        # Measure performance
        start = time.time()
        detections = test_detector.detector(img)
        elapsed = time.time() - start
        
        print(f"Detected {len(detections)} plates in {elapsed:.4f}s")
        print(f"Speed: {1/elapsed:.1f} FPS")
        
    except Exception as e:
        print(f"⚠️ {backend_name} not available: {e}")

⚙️ Performance Results (Typical)

Backend	Detection	OCR	Total	FPS
CPU (i7)	~0.15s	~0.03s	~0.18s	~5.5
CUDA (RTX 3060)	~0.008s	~0.002s	~0.01s	~100

Result: GPU acceleration = 18x faster! 🚀

🎛️ Custom Filtering

# Filter detections by confidence
min_detection_conf = 0.50
min_ocr_conf = 80.0

filtered_results = []

for box_info in detections:
    if box_info['score'] < min_detection_conf:
        continue  # Skip low confidence detections
    
    # Process with OCR...
    text, conf = ocr.predict(plate_crop)
    
    if conf < min_ocr_conf:
        continue  # Skip low confidence OCR
    
    filtered_results.append({
        "text": text,
        "confidence": conf,
        "bbox": bbox
    })

print(f"After filtering: {len(filtered_results)} high-confidence plates")

🎨 Custom Visualization

import cv2

# Draw boxes and text on image
for result in results:
    x1, y1, x2, y2 = result['bbox']
    text = result['ocr']
    conf = result['ocr_conf']
    
    # Draw rectangle
    cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
    
    # Draw text
    label = f"{text} ({conf}%)"
    cv2.putText(img, label, (x1, y1-10), 
                cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

cv2.imwrite("result.jpg", img)

📹 Video Processing Pipeline

import cv2

# Open video
cap = cv2.VideoCapture("traffic.mp4")

frame_count = 0
plate_history = {}

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    
    frame_count += 1
    
    # Process every N frames (skip frames for speed)
    if frame_count % 5 != 0:
        continue
    
    # Detect plates
    detections = detector.detector(frame)
    
    for det in detections:
        bbox = det['bbox']
        x1, y1, x2, y2 = int(bbox[0]), int(bbox[1]), int(bbox[2]), int(bbox[3])
        crop = frame[y1:y2, x1:x2]
        
        if crop.size == 0:
            continue
        
        # OCR
        pil_crop = Image.fromarray(cv2.cvtColor(crop, cv2.COLOR_BGR2RGB))
        text, conf = ocr.predict(pil_crop)
        
        # Track plates (simple tracking by position)
        plate_id = f"{x1//50}_{y1//50}"
        
        if plate_id not in plate_history:
            plate_history[plate_id] = []
        plate_history[plate_id].append(text)
        
        # Draw
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, text, (x1, y1-10), 
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
    
    cv2.imshow('ANPR', frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

# Print detected plates
print("\nDetected plates:")
for plate_id, texts in plate_history.items():
    # Most common text for this plate
    most_common = max(set(texts), key=texts.count)
    print(f"  {most_common} (seen {len(texts)} times)")

💾 Batch Processing from Directory

import os
from pathlib import Path

image_dir = Path("./images")
results_all = {}

for img_path in image_dir.glob("*.jpg"):
    print(f"Processing {img_path.name}...")
    
    img = cv2.imread(str(img_path))
    detections = detector.detector(img)
    
    plates = []
    for det in detections:
        bbox = det['bbox']
        x1, y1, x2, y2 = int(bbox[0]), int(bbox[1]), int(bbox[2]), int(bbox[3])
        crop = img[y1:y2, x1:x2]
        
        if crop.size > 0:
            pil_crop = Image.fromarray(cv2.cvtColor(crop, cv2.COLOR_BGR2RGB))
            text, conf = ocr.predict(pil_crop)
            plates.append({"text": text, "conf": conf})
    
    results_all[img_path.name] = plates

# Save results
import json
with open("results.json", "w") as f:
    json.dump(results_all, f, indent=2)

print(f"\nProcessed {len(results_all)} images")

🎓 Advanced Tips

GPU Memory: Use cuda backend for 10-100x speedup
Confidence Tuning: Lower conf_thres to 0.15-0.20 for difficult images
IOU Threshold: Increase iou_thres to reduce duplicate detections
Batch Processing: Process multiple crops at once with ocr.predict([img1, img2, ...])
Frame Skipping: Process every Nth frame in videos for speed
Multi-threading: Run detector and OCR in separate threads

🔍 Troubleshooting

No detections?

Lower conf_thres to 0.15
Try larger model (large_640p_fp32)
Check image quality and resolution

Wrong OCR results?

Verify correct region (kr, eup, na, cn)
Try larger OCR model (large_fp32)
Check plate crop quality

Slow performance?

Use GPU backend (cuda or directml)
Use smaller models (small_640p_fp32, small_fp32)
Skip video frames
Batch process multiple images

💡 Conclusion

Manual processing gives you complete control over the ANPR pipeline. Use it for:

✅ Custom filtering and validation
✅ Performance optimization
✅ Video stream processing
✅ Integration with existing CV pipelines
✅ Advanced visualization and tracking

Happy optimizing! ⚡🚗

8/25/2025

GPU memory vs shared memory

CK Tile Tutorial Day 2 (AMD hip programming) - Simple GEMM.

Concepts Added:

2D grid/block configuration
Matrix multiplication basics
Each thread computes one output element

Key Pattern:

// Each thread computes C[row][col]
for (int k = 0; k < K; k++) {
    sum += A[row][k] * B[k][col];
}

=== Thread Mapping Visualization ===

Each thread computes one C[i][j]:

Block(0,0) Block(1,0)

┌─────────┐ ┌─────────┐

│T00 T01..│ │T00 T01..│

│T10 T11..│ │T10 T11..│

│... ... ..│ │... ... ..│

└─────────┘ └─────────┘

↓ ↓

C[0:16,0:16] C[0:16,16:32]

Each thread's work:

for k in 0..K:

sum += A[row][k] * B[k][col]

C[row][col] = sum

=== Step 2: Simple GEMM ===

Matrix multiply: (64x64) * (64x64) = (64x64)

Launching with grid(4,4), block(16,16)

Result: CORRECT

Time: 0.4232 ms

Performance: 1.23887 GFLOPS

=== Step 2: Simple GEMM ===

Matrix multiply: (128x128) * (128x128) = (128x128)

Launching with grid(8,8), block(16,16)

Result: CORRECT

Time: 0.03824 ms

Performance: 109.684 GFLOPS

Key Concepts Added:

1. 2D grid/block configuration

2. Each thread computes one output element

3. Row-major vs column-major layouts

4. Performance measurement (GFLOPS)

code

// Step 2: Simple GEMM (Matrix Multiplication)
// Building on Step 1, now each thread computes one output element

#include <hip/hip_runtime.h>
#include <iostream>
#include <vector>

// ============================================
// PART 1: Kernel Arguments
// ============================================
struct SimpleGemmKernelArgs {
    const float* a_ptr;  // M x K matrix
    const float* b_ptr;  // K x N matrix
    float* c_ptr;        // M x N matrix
    int M;
    int N; 
    int K;
    
    SimpleGemmKernelArgs(const float* a, const float* b, float* c,
                         int m, int n, int k)
        : a_ptr(a), b_ptr(b), c_ptr(c), M(m), N(n), K(k) {}
};

// ============================================
// PART 2: The Kernel (One thread per output)
// ============================================
struct SimpleGemmKernel {
    
    static dim3 GridSize(const SimpleGemmKernelArgs& args) {
        // 16x16 threads per block
        int grid_m = (args.M + 15) / 16;
        int grid_n = (args.N + 15) / 16;
        return dim3(grid_n, grid_m, 1);  // Note: x=N, y=M
    }
    
    static dim3 BlockSize() {
        return dim3(16, 16, 1);  // 16x16 = 256 threads
    }
    
    __device__ void operator()(const SimpleGemmKernelArgs& args) const {
        // Each thread computes one element of C
        int col = blockIdx.x * blockDim.x + threadIdx.x;  // N dimension
        int row = blockIdx.y * blockDim.y + threadIdx.y;  // M dimension
        
        // Bounds check
        if (row >= args.M || col >= args.N) return;
        
        // Compute dot product for C[row][col]
        float sum = 0.0f;
        for (int k = 0; k < args.K; k++) {
            // A is row-major: A[row][k] = A[row * K + k]
            // B is column-major: B[k][col] = B[k + col * K]
            float a_val = args.a_ptr[row * args.K + k];
            float b_val = args.b_ptr[k + col * args.K];
            sum += a_val * b_val;
        }
        
        // Store result (C is row-major)
        args.c_ptr[row * args.N + col] = sum;
    }
};

// ============================================
// PART 3: Host Code
// ============================================
__global__ void simple_gemm_kernel(SimpleGemmKernelArgs args) {
    SimpleGemmKernel kernel;
    kernel(args);
}

void run_simple_gemm(int M, int N, int K) {
    std::cout << "\n=== Step 2: Simple GEMM ===\n";
    std::cout << "Matrix multiply: (" << M << "x" << K << ") * (" 
              << K << "x" << N << ") = (" << M << "x" << N << ")\n";
    
    // Allocate host memory
    std::vector<float> h_a(M * K);
    std::vector<float> h_b(K * N);
    std::vector<float> h_c(M * N, 0.0f);
    
    // Initialize with simple values
    for (int i = 0; i < M * K; i++) h_a[i] = 1.0f;
    for (int i = 0; i < K * N; i++) h_b[i] = 2.0f;
    
    // Allocate device memory
    float *d_a, *d_b, *d_c;
    hipMalloc(&d_a, M * K * sizeof(float));
    hipMalloc(&d_b, K * N * sizeof(float));
    hipMalloc(&d_c, M * N * sizeof(float));
    
    // Copy to device
    hipMemcpy(d_a, h_a.data(), M * K * sizeof(float), hipMemcpyHostToDevice);
    hipMemcpy(d_b, h_b.data(), K * N * sizeof(float), hipMemcpyHostToDevice);
    
    // Create kernel arguments
    SimpleGemmKernelArgs args(d_a, d_b, d_c, M, N, K);
    
    // Get launch configuration
    dim3 grid = SimpleGemmKernel::GridSize(args);
    dim3 block = SimpleGemmKernel::BlockSize();
    
    std::cout << "Launching with grid(" << grid.x << "," << grid.y 
              << "), block(" << block.x << "," << block.y << ")\n";
    
    // Launch kernel
    hipEvent_t start, stop;
    hipEventCreate(&start);
    hipEventCreate(&stop);
    
    hipEventRecord(start);
    simple_gemm_kernel<<<grid, block>>>(args);
    hipEventRecord(stop);
    hipEventSynchronize(stop);
    
    float milliseconds = 0;
    hipEventElapsedTime(&milliseconds, start, stop);
    
    // Copy result back
    hipMemcpy(h_c.data(), d_c, M * N * sizeof(float), hipMemcpyDeviceToHost);
    
    // Verify (each element should be K * 1.0 * 2.0 = 2K)
    float expected = 2.0f * K;
    bool correct = true;
    for (int i = 0; i < std::min(10, M*N); i++) {
        if (h_c[i] != expected) {
            correct = false;
            break;
        }
    }
    
    std::cout << "Result: " << (correct ? "CORRECT" : "WRONG") << "\n";
    std::cout << "Time: " << milliseconds << " ms\n";
    
    // Calculate FLOPS
    double flops = 2.0 * M * N * K;  // 2 ops per multiply-add
    double gflops = (flops / milliseconds) / 1e6;
    std::cout << "Performance: " << gflops << " GFLOPS\n";
    
    // Cleanup
    hipFree(d_a);
    hipFree(d_b);
    hipFree(d_c);
    hipEventDestroy(start);
    hipEventDestroy(stop);
}

// ============================================
// VISUALIZATION: How threads map to output
// ============================================
void visualize_thread_mapping() {
    std::cout << "\n=== Thread Mapping Visualization ===\n";
    std::cout << "Each thread computes one C[i][j]:\n\n";
    std::cout << "  Block(0,0)        Block(1,0)\n";
    std::cout << "  ┌─────────┐      ┌─────────┐\n";
    std::cout << "  │T00 T01..│      │T00 T01..│\n";
    std::cout << "  │T10 T11..│      │T10 T11..│\n";
    std::cout << "  │... ... ..│      │... ... ..│\n";
    std::cout << "  └─────────┘      └─────────┘\n";
    std::cout << "       ↓                ↓\n";
    std::cout << "  C[0:16,0:16]    C[0:16,16:32]\n\n";
    
    std::cout << "Each thread's work:\n";
    std::cout << "  for k in 0..K:\n";
    std::cout << "    sum += A[row][k] * B[k][col]\n";
    std::cout << "  C[row][col] = sum\n";
}

// ============================================
// PART 4: Main
// ============================================
int main() {
    std::cout << "MareArts CK Tile Tutorial - Step 2: Simple GEMM\n";
    std::cout << "======================================\n";
    
    visualize_thread_mapping();
    
    // Run with different sizes
    run_simple_gemm(64, 64, 64);
    run_simple_gemm(128, 128, 128);
    
    std::cout << "\nKey Concepts Added:\n";
    std::cout << "1. 2D grid/block configuration\n";
    std::cout << "2. Each thread computes one output element\n";
    std::cout << "3. Row-major vs column-major layouts\n";
    std::cout << "4. Performance measurement (GFLOPS)\n";
    std::cout << "\nProblem: Each thread reads K elements from A and B\n";
    std::cout << "         → Poor memory reuse!\n";
    std::cout << "Next: Add tiling and shared memory for efficiency\n";
    
    return 0;
}

🙇🏻‍♂️

MareArts

8/24/2025

CK Tile Tutorial Day 1 (AMD hip programming) - Vector add.

Concepts:

Basic kernel structure: Args → Kernel → operator()
Grid/Block configuration
One thread per element processing

Key Code:

struct VectorAddKernel {
    __device__ void operator()(args) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        c[idx] = a[idx] + b[idx];
    }
};

code

// Step 1: Simplest CK Tile Kernel - Vector Addition
// This demonstrates the absolute basics of CK Tile

#include <hip/hip_runtime.h>
#include <iostream>
#include <vector>

// ============================================
// PART 1: Kernel Arguments (Host → Device)
// ============================================
struct VectorAddKernelArgs {
    const float* a_ptr;
    const float* b_ptr;
    float* c_ptr;
    int n;
    
    // Constructor from host arguments
    VectorAddKernelArgs(const float* a, const float* b, float* c, int size)
        : a_ptr(a), b_ptr(b), c_ptr(c), n(size) {}
};

// ============================================
// PART 2: The Kernel
// ============================================
struct VectorAddKernel {
    
    // Static method to get grid size (how many blocks)
    static dim3 GridSize(const VectorAddKernelArgs& args) {
        // 256 threads per block, divide work
        int blocks = (args.n + 255) / 256;
        return dim3(blocks, 1, 1);
    }
    
    // Static method to get block size (threads per block)
    static dim3 BlockSize() {
        return dim3(256, 1, 1);
    }
    
    // The actual kernel function - called by each thread
    __device__ void operator()(const VectorAddKernelArgs& args) const {
        // Calculate global thread index
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        
        // Check bounds
        if (idx < args.n) {
            // Each thread does one element
            args.c_ptr[idx] = args.a_ptr[idx] + args.b_ptr[idx];
        }
    }
};

// ============================================
// PART 3: Host Launch Function
// ============================================
__global__ void vector_add_kernel(VectorAddKernelArgs args) {
    VectorAddKernel kernel;
    kernel(args);
}

void run_vector_add(int n) {
    std::cout << "\n=== Step 1: Vector Addition ===\n";
    std::cout << "Adding two vectors of size " << n << "\n";
    
    // Allocate host memory
    std::vector<float> h_a(n, 1.0f);
    std::vector<float> h_b(n, 2.0f);
    std::vector<float> h_c(n, 0.0f);
    
    // Allocate device memory
    float *d_a, *d_b, *d_c;
    hipMalloc(&d_a, n * sizeof(float));
    hipMalloc(&d_b, n * sizeof(float));
    hipMalloc(&d_c, n * sizeof(float));
    
    // Copy to device
    hipMemcpy(d_a, h_a.data(), n * sizeof(float), hipMemcpyHostToDevice);
    hipMemcpy(d_b, h_b.data(), n * sizeof(float), hipMemcpyHostToDevice);
    
    // Create kernel arguments
    VectorAddKernelArgs args(d_a, d_b, d_c, n);
    
    // Get launch configuration
    dim3 grid = VectorAddKernel::GridSize(args);
    dim3 block = VectorAddKernel::BlockSize();
    
    std::cout << "Launching with grid(" << grid.x << "), block(" << block.x << ")\n";
    
    // Launch kernel
    vector_add_kernel<<<grid, block>>>(args);
    
    // Copy result back
    hipMemcpy(h_c.data(), d_c, n * sizeof(float), hipMemcpyDeviceToHost);
    
    // Verify
    bool correct = true;
    for (int i = 0; i < std::min(10, n); i++) {
        if (h_c[i] != 3.0f) {
            correct = false;
            break;
        }
    }
    
    std::cout << "Result: " << (correct ? "CORRECT" : "WRONG") << "\n";
    std::cout << "First 5 elements: ";
    for (int i = 0; i < std::min(5, n); i++) {
        std::cout << h_c[i] << " ";
    }
    std::cout << "\n";
    
    // Cleanup
    hipFree(d_a);
    hipFree(d_b);
    hipFree(d_c);
}

// ============================================
// PART 4: Main
// ============================================
int main() {
    std::cout << "MareArts CK Tile Tutorial - Step 1: Vector Addition\n";
    std::cout << "==========================================\n";
    
    // Run with different sizes
    run_vector_add(1024);
    run_vector_add(10000);
    
    std::cout << "\nKey Concepts Demonstrated:\n";
    std::cout << "1. Kernel structure: Args → Kernel → operator()\n";
    std::cout << "2. Grid/Block configuration\n";
    std::cout << "3. Each thread processes one element\n";
    std::cout << "4. Bounds checking for safety\n";
    
    return 0;
}

...

Result

CK Tile Tutorial - Step 1: Vector Addition

==========================================

=== Step 1: Vector Addition ===

Adding two vectors of size 1024

Launching with grid(4), block(256)

Result: CORRECT

First 5 elements: 3 3 3 3 3

=== Step 1: Vector Addition ===

Adding two vectors of size 10000

Launching with grid(40), block(256)

Result: CORRECT

First 5 elements: 3 3 3 3 3

Key Concepts Demonstrated:

1. Kernel structure: Args → Kernel → operator()

2. Grid/Block configuration

3. Each thread processes one element

4. Bounds checking for safety

3/07/2025

Check my torch support GPU

checkgpu.py

import torch

# Check PyTorch version
print(f"PyTorch version: {torch.__version__}")

# Check if CUDA/ROCm is available (unified API in newer PyTorch)
print(f"Is GPU available: {torch.cuda.is_available()}")

# Check how many GPUs are available
if torch.cuda.is_available():
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    
    # Print device properties for each GPU
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"\nDevice {i}: {props.name}")
        print(f"  Total memory: {props.total_memory / 1024**3:.2f} GB")
        if hasattr(props, 'major'):
            print(f"  Compute capability: {props.major}.{props.minor}")

# Try a simple operation on GPU
if torch.cuda.is_available():
    device = torch.device("cuda:0")  # Use the first GPU
    x = torch.ones(5, 5, device=device)
    y = x + 1
    print("\nGPU computation test:")
    print(y)
    print("GPU computation successful! study.marearts.com")
else:
    print("\nNo GPUs available for PyTorch.")

🙏

Thank you!

9/17/2024

What is IREE turbine

IREE-Turbine is a package or toolset that combines PyTorch, Torch-MLIR, IREE, and additional tools to provide a comprehensive solution for compiling, optimizing, and executing PyTorch models using IREE's infrastructure. Based on the information in the image, IREE-Turbine offers the following key features:

1. AOT Export: This allows for Ahead-Of-Time compilation of PyTorch modules (nn.Modules) into deployment-ready artifacts. These compiled artifacts can then take full advantage of IREE's runtime features.

2. Eager Execution: It provides a torch.compile backend and a Turbine Tensor/Device for interactive PyTorch sessions. This enables users to work with PyTorch in a familiar environment while leveraging IREE's optimization capabilities.

3. Custom Ops: IREE-Turbine offers integration for defining custom PyTorch operations and implementing them using either IREE's backend IR or the Pythonic kernel language. This allows for extending PyTorch's functionality while maintaining compatibility with IREE's optimization pipeline.

In essence, IREE-Turbine acts as a bridge between PyTorch and IREE, allowing PyTorch users to benefit from IREE's advanced compilation and runtime features while maintaining a familiar PyTorch-based workflow. It aims to provide a seamless experience for compiling PyTorch models to run efficiently on various hardware targets supported by IREE.

HIP kernel for matrix multiplication that can leverage Matrix Cores

Here's an example of a custom HIP kernel for matrix multiplication that can leverage Matrix Cores:

```cpp
#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>
#include <iostream>

// Define matrix dimensions
#define M 16
#define N 16
#define K 16

// HIP kernel for matrix multiplication
__global__ void matrixMulKernel(half* A, half* B, float* C) {
    // Shared memory for tile of A and B
    __shared__ half As[M][K];
    __shared__ half Bs[K][N];

    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Index of the first sub-matrix of A processed by the block
    int aBegin = K * M * by;
    // Index of the last sub-matrix of A processed by the block
    int aEnd   = aBegin + K - 1;
    // Step size used to iterate through the sub-matrices of A
    int aStep  = M;

    // Index of the first sub-matrix of B processed by the block
    int bBegin = N * bx;
    // Step size used to iterate through the sub-matrices of B
    int bStep  = K * N;

    // Csub is used to store the element of the block sub-matrix
    // that is computed by the thread
    float Csub = 0;

    // Loop over all the sub-matrices of A and B
    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
        // Load the matrices from device memory to shared memory
        As[ty][tx] = A[a + K * ty + tx];
        Bs[ty][tx] = B[b + N * ty + tx];

        // Synchronize to make sure the matrices are loaded
        __syncthreads();

        // Multiply the two matrices
        #pragma unroll
        for (int k = 0; k < K; ++k) {
            Csub += __half2float(As[ty][k]) * __half2float(Bs[k][tx]);
        }

        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
    }

    // Write the block sub-matrix to device memory
    // each thread writes one element
    int c = N * M * by + M * bx;
    C[c + N * ty + tx] = Csub;
}

// Host function to launch the kernel
void launchMatrixMulKernel(half* A, half* B, float* C, int numBlocks) {
    dim3 threadsPerBlock(M, N);
    dim3 blocksPerGrid(numBlocks, numBlocks);
    hipLaunchKernelGGL(matrixMulKernel, blocksPerGrid, threadsPerBlock, 0, 0, A, B, C);
}

// Error checking macro
#define HIP_CHECK(call) { hipError_t err = call; if (err != hipSuccess) { std::cerr << "HIP error: " << hipGetErrorString(err) << std::endl; exit(1); } }

int main() {
    // Allocate memory
    half *A, *B;
    float *C;
    HIP_CHECK(hipMalloc(&A, M * K * sizeof(half)));
    HIP_CHECK(hipMalloc(&B, K * N * sizeof(half)));
    HIP_CHECK(hipMalloc(&C, M * N * sizeof(float)));

    // Initialize matrices (you would typically do this on the GPU)
    // ...

    // Launch kernel
    launchMatrixMulKernel(A, B, C, 1);  // Assuming 1 block for simplicity

    // Clean up
    HIP_CHECK(hipFree(A));
    HIP_CHECK(hipFree(B));
    HIP_CHECK(hipFree(C));

    return 0;
}

```

Key points about this example:

1. It uses `half` precision for input matrices A and B, which can potentially benefit from Matrix Core acceleration.

2. The kernel is designed for 16x16 matrices, which is a common size for Matrix Core operations.

3. Shared memory is used to improve performance by reducing global memory accesses.

4. The main computation loop uses `__half2float` conversions. On GPUs with native FP16 support, these conversions might be optimized out.

5. The kernel uses a tiled approach, which is generally efficient for matrix multiplication.

6. Error checking is included for HIP calls.

Important considerations:

1. This kernel doesn't guarantee the use of Matrix Cores. The actual use of Matrix Cores depends on the GPU architecture and the HIP compiler's optimizations.

2. For larger matrices, you'd need to implement a more sophisticated tiling strategy.

3. Performance tuning is crucial. You might need to experiment with different block sizes and memory access patterns for optimal performance.

4. The HIP runtime and compiler will attempt to optimize this code for the target GPU, potentially leveraging Matrix Cores if available.

5. For production use, you should implement proper error handling and potentially use more sophisticated synchronization methods.

To fully leverage Matrix Cores, you might need to use specific intrinsics or rely on compiler optimizations. The exact method can vary depending on the GPU architecture and HIP version. Always profile your code to ensure you're getting the expected performance benefits.

Creating a custom CUDA kernel that directly utilizes tensor cores

Creating a custom CUDA kernel that directly utilizes tensor cores is an advanced topic, as tensor cores are typically accessed through higher-level libraries like cuBLAS or cuDNN. However, NVIDIA does provide a way to use tensor cores in custom kernels through their CUDA Core library, specifically with Warp Matrix Multiply-Accumulate (WMMA) API. Here's an overview of how to create a kernel that works on tensor cores:

1. Use CUDA Core WMMA API:

   The WMMA API allows you to program tensor cores directly in your CUDA kernels.

2. Include necessary headers:

   ```cpp

   #include <mma.h>

   #include <cuda_fp16.h>

```

3. Use appropriate data types:

   Tensor cores work with specific data types like half precision floating point (`__half`).

4. Define matrix fragments:

   Use `nvcuda::wmma::fragment` to define matrix fragments that will be processed by tensor cores.

5. Load, compute, and store operations:

   Use WMMA load, multiply-accumulate, and store operations.

Here's an example kernel that uses tensor cores via the WMMA API:

```cuda

#include <mma.h>

#include <cuda_fp16.h>

// Define matrix dimensions

const int M = 16;

const int N = 16;

const int K = 16;

__global__ void wmma_example(half *a, half *b, float *c) {

    // Declare the fragments

    nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, M, N, K, half, nvcuda::wmma::col_major> a_frag;

    nvcuda::wmma::fragment<nvcuda::wmma::matrix_b, M, N, K, half, nvcuda::wmma::col_major> b_frag;

    nvcuda::wmma::fragment<nvcuda::wmma::accumulator, M, N, K, float> c_frag;

    // Initialize the output to zero

    nvcuda::wmma::fill_fragment(c_frag, 0.0f);

    // Load the inputs

    nvcuda::wmma::load_matrix_sync(a_frag, a, K);

    nvcuda::wmma::load_matrix_sync(b_frag, b, K);

    // Perform the matrix multiplication

    nvcuda::wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

    // Store the output

    nvcuda::wmma::store_matrix_sync(c, c_frag, N, nvcuda::wmma::mem_row_major);

}

// Host function to launch the kernel

void launch_wmma_kernel(half *a, half *b, float *c) {

    dim3 gridDim(1);

    dim3 blockDim(32);  // One warp

    wmma_example<<<gridDim, blockDim>>>(a, b, c);

}

```

Key points about this example:

1. We're using 16x16 matrices as this is a common size for tensor core operations.
2. The kernel uses `nvcuda::wmma::fragment` to define matrix fragments.
3. `load_matrix_sync`, `mma_sync`, and `store_matrix_sync` are used to load data, perform matrix multiplication, and store results using tensor cores.
4. The kernel operates on half-precision input (`half`) and produces single-precision output (`float`).

To use this kernel:

1. Compile with a CUDA compiler that supports tensor cores (CUDA 9.0 or later).
2. Use appropriate GPU architecture flags (e.g., `-arch=sm_70` for Volta, `-arch=sm_75` for Turing).
3. Allocate memory and copy data to the GPU before calling `launch_wmma_kernel`.

Important considerations:

1. Error checking is omitted for brevity but should be included in production code.
2. This is a basic example. Real-world usage often involves tiling and more complex memory access patterns for larger matrices.
3. Performance tuning is crucial. The exact dimensions and data types should be chosen based on your specific use case and target GPU architecture.
4. Not all operations can be efficiently mapped to tensor cores. They're most beneficial for large matrix multiplications common in deep learning workloads.

Remember, while this approach gives you direct control over tensor core usage, in many cases, using higher-level libraries like cuBLAS or cuDNN is more practical and can automatically leverage tensor cores when appropriate.

Pages

12/30/2025

⚡ MareArts ANPR V14 Models - Performance, Metrics & How to Choose

🎯 Two-Stage Pipeline Architecture

📊 Detector Models - Find License Plates

Model Sizes Explained

Resolution Options

Precision Options

Complete Detector Performance Table

📖 OCR Models - Read License Plate Text

Two Key Metrics

Complete OCR Performance by Region

🌍 Universal (univ) - All Regions

🇰🇷 Korean (kr) - Best Overall Accuracy

🇪🇺 Europe+ (eup) - EU + Additional Countries

🇺🇸 North America (na) - USA, Canada, Mexico

🇨🇳 China (cn)

OCR Model Averages (All Regions)

🌍 Regional Vocabulary Support

🎯 How to Choose the Right Models

Use Case 1: Parking Management

Use Case 2: Security Checkpoint (Critical)

Use Case 3: Traffic Monitoring (High Volume)

Use Case 4: Mobile/Edge Device

Use Case 5: Law Enforcement (Difficult Conditions)

📈 Performance Comparison Chart

Detector Models: Speed vs Accuracy

OCR Models: Speed vs Accuracy

💡 Performance Tips

1. GPU Acceleration is Essential

2. Use Batch Processing

3. Choose Resolution Wisely

4. Tune Confidence Thresholds

5. Use Specific Regions

🚀 Quick Decision Guide

📊 Benchmark Environment

🎓 Key Takeaways

💻 Example Configuration

🔗 Resources

🎯 Conclusion

⚡ MareArts ANPR V14 - Advanced Manual Processing

🎯 Why Manual Processing?

🔧 Manual Detection & OCR Pipeline

📊 Detection Object Structure

🚀 Backend Performance Comparison

⚙️ Performance Results (Typical)

🎛️ Custom Filtering

🎨 Custom Visualization

📹 Video Processing Pipeline

💾 Batch Processing from Directory

🎓 Advanced Tips

🔍 Troubleshooting

💡 Conclusion

8/25/2025

8/24/2025

3/07/2025

9/17/2024