utf8encrypt

A Zig library for UTF-8 length-preserving encryption that encrypts UTF-8 text while preserving valid UTF-8 encoding and exact byte length.

⚠️ EXPERIMENTAL: This library is highly experimental. Do not use in production systems or for protecting sensitive data.

Features

Valid UTF-8 Output: Encrypted text is guaranteed to be valid UTF-8
Byte-Length Preservation: Output has identical byte count as input
Class Preservation: Each code point stays within its UTF-8 byte-length class (1-4 bytes)
Control Character Passthrough: ASCII control characters (codes 0-31) are preserved unchanged for structural integrity
Optional Boundary Space Avoidance: Configurable option to ensure first and last ciphertext characters are never spaces (for class 1 printable ASCII)
Content-Dependent Permutation: Character class sequence is shuffled based on key and content to hide structural patterns

Quick Start

const std = @import("std");
const utf8encrypt = @import("utf8encrypt");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // Initialize cipher with 16-byte key and boundary space avoidance option
    const key: [16]u8 = @splat(0x2B);
    const avoid_boundary_spaces = false; // Set to true to prevent spaces at boundaries
    var cipher = try utf8encrypt.Utf8Cipher.init(allocator, &key, avoid_boundary_spaces);
    defer cipher.deinit();

    // Encrypt UTF-8 text
    const plaintext = "Hello, World!";
    const tweak = "context:record123:field:name";
    const ciphertext = try cipher.encrypt(plaintext, tweak);
    defer allocator.free(ciphertext);

    std.debug.print("Plaintext:  {s} ({d} bytes)\n", .{ plaintext, plaintext.len });
    std.debug.print("Ciphertext: {s} ({d} bytes)\n", .{ ciphertext, ciphertext.len });

    // Decrypt
    const decrypted = try cipher.decrypt(ciphertext, tweak);
    defer allocator.free(decrypted);

    std.debug.print("Match: {}\n", .{std.mem.eql(u8, plaintext, decrypted)});
}

How It Works

UTF-8 Byte-Length Classes

The library divides Unicode code points into 4 classes based on UTF-8 byte length:

Class	Code Point Range	UTF-8 Bytes	Domain Size	Encryption Method
1	U+0000 - U+007F	1 byte	96	Fisher-Yates permutation (tweak-dependent)
2	U+0080 - U+07FF	2 bytes	1,920	FAST cipher + cycle walking
3	U+0800 - U+FFFF	3 bytes	61,440	FAST cipher + cycle walking
4	U+10000 - U+10FFFF	4 bytes	1,048,576	FAST cipher + cycle walking

Class 1 control characters (U+0000 - U+001F, codes 0-31) are not encrypted and pass through unchanged. Only printable ASCII (U+0020 - U+007F, codes 32-127) are encrypted.

Class 3 excludes surrogates U+D800 - U+DFFF (invalid in UTF-8)

Encryption Process

Code Point Classification: Each code point is classified into one of the 4 byte-length classes
Control Character Passthrough: Class 1 control characters (codes 0-31) are preserved unchanged
Domain Mapping: Code points are mapped to domain indices (0..N-1) within their class
Format-Preserving Encryption:
- Class 1 (printable ASCII): Uses Fisher-Yates shuffle with 256-bit seed from TurboSHAKE128
- Classes 2-4: Uses FAST cipher with cycle walking until result falls within valid domain
Boundary Space Avoidance (Class 1 only, optional):
- Can be enabled via avoid_boundary_spaces parameter in initialization
- When enabled: For first and last positions, if Fisher-Yates LUT produces a space, apply LUT again
- This ensures boundary characters in ciphertext are never spaces (for class 1)
- Applied before content-dependent permutation
- Disabled by default for maximum flexibility
Chaining: Position-dependent tweaks incorporate previous ciphertext bytes (CBC-like mode)
- Note: Control characters participate in chaining even though they're not encrypted
Content-Dependent Permutation: The order of encrypted code points is shuffled based on:
- Sorted encrypted code points (for deterministic permutation derivation)
- Encryption key (for key-dependent permutation)
- Fisher-Yates shuffle with TurboSHAKE128 as PRNG
UTF-8 Encoding: Shuffled encrypted code points are encoded back to UTF-8

This process ensures that:

Each code point remains within its original byte-length class
Control characters (newlines, tabs, etc.) are preserved for structural integrity
Boundary characters (first/last) can optionally be prevented from becoming spaces (for class 1 printable ASCII)
Total byte length is preserved
The sequence of character classes is permuted to hide structural patterns

Chaining Mode

The library uses a CBC-like chaining mode:

IV Generation: IV = TurboSHAKE128(base_tweak)[0..16]
Chained Tweaks: tweak_i = base_tweak || ":pos:" || i || ":chain:" || C_{i-1} (binary concatenation)
Identical plaintexts at different positions produce different ciphertexts

Use Cases

This library is ideal for encrypting UTF-8 text in length-constrained environments:

Social Network Posts: Encrypt messages while respecting character/byte limits (Twitter, Mastodon, etc.)
Database Fields: Encrypt VARCHAR/TEXT fields without changing column size requirements
Filesystem: Encrypt filenames that must be valid UTF-8 and have byte length restrictions
Protocol Messages: Encrypt fixed-length UTF-8 fields in network protocols
Legacy Systems: Encrypt data for systems that validate UTF-8 and enforce byte limits

Security Considerations

Cryptographic Properties

Deterministic Encryption: Same plaintext + key + tweak always produces same ciphertext (by design for format-preserving encryption)
Primitives:
- Classes 2-4: Use FAST cipher (format-preserving encryption algorithm)
- Class 1: Uses ChaCha PRNG with 256-bit seeds derived from TurboSHAKE128

Limitations

Not IND-CPA Secure: Not probabilistic; not suitable for high-security applications requiring IND-CPA security
Control Characters Unencrypted: ASCII control characters (codes 0-31 like newlines, tabs, etc.) pass through unencrypted
Frequency Preservation: Character frequency within each class is preserved (susceptible to frequency analysis within each class)
Structural Obfuscation: While the sequence of character classes is permuted to hide patterns, the multiset of classes remains the same (e.g., input with 2 ASCII + 1 emoji will produce output with 2 ASCII + 1 emoji, but in different positions)
No Authentication: Provides confidentiality only; no integrity or authentication guarantees

Performance

Encryption Speed by Class

Performance varies by UTF-8 class due to different encryption methods:

Class 1 (ASCII): ~25% slower than pre-computed tables (Fisher-Yates per code point)
Class 2: ~34 average cycle-walking iterations
Class 3: ~1.03 average cycle-walking iterations
Class 4: ~16 average cycle-walking iterations

Sample Output

Here are examples showing actual input text and encrypted output:

ASCII Text

Plaintext:  Hello, World! (13 bytes)
Ciphertext: L;ba>{AeiWr (13 bytes)

The encrypted text maintains the exact byte length and remains valid UTF-8, but the content is transformed.

Multi-Byte UTF-8 (Latin + Chinese)

Plaintext:  Héllo 世界 (13 bytes)
Ciphertext: LȦ.O@䧵 (13 bytes)

Characters from different UTF-8 classes are encrypted while preserving their byte-length classes. The Chinese characters (3 bytes each) remain 3-byte UTF-8 characters in the output. Note that the order of character classes may be permuted in the encrypted output.

Emoji (4-Byte UTF-8)

Plaintext:  🌍 emoji test 🚀 (20 bytes)
Ciphertext: 񃻸RhufBh]E񄃂 (20 bytes)

Emojis are 4-byte UTF-8 sequences that encrypt to other 4-byte UTF-8 sequences, maintaining the byte structure. The permutation may reorder ASCII and emoji characters while preserving the total byte count.

Accented Characters

Plaintext:  Café résumé naïve (21 bytes)
Ciphertext: ~Ԟtݛ]0و>tM¸+ (21 bytes)

Latin characters with diacritics (é, ï) are 2-byte UTF-8 sequences that encrypt to other 2-byte sequences. Each character stays within its byte-length class, but the class sequence may be reordered.

Note: The encrypted output shown above is deterministic for a given key and tweak. Different keys, tweaks, or positions will produce different ciphertexts.

Examples

Basic Encryption/Decryption

// Initialize with boundary space avoidance disabled (default)
const cipher = try Utf8Cipher.init(allocator, &key, false);
defer cipher.deinit();

const ciphertext = try cipher.encrypt("Hello, World!", "myapp:user123:field:message");
defer allocator.free(ciphertext);

const plaintext = try cipher.decrypt(ciphertext, "myapp:user123:field:message");
defer allocator.free(plaintext);

With Boundary Space Avoidance

// Initialize with boundary space avoidance enabled
const cipher = try Utf8Cipher.init(allocator, &key, true);
defer cipher.deinit();

// First and last characters will never be spaces (for class 1 printable ASCII)
const ciphertext = try cipher.encrypt("Hello, World!", "myapp:user123:field:message");
defer allocator.free(ciphertext);

Unique Tweaks for Different Contexts

const cipher = try Utf8Cipher.init(allocator, &key, false);
defer cipher.deinit();

// Different tweaks for different contexts prevent cross-context attacks
const email_encrypted = try cipher.encrypt(user_email, "app:users:email");
defer allocator.free(email_encrypted);

const phone_encrypted = try cipher.encrypt(user_phone, "app:users:phone");
defer allocator.free(phone_encrypted);

Multi-Byte UTF-8

const cipher = try Utf8Cipher.init(allocator, &key, false);
defer cipher.deinit();

const multilingual = "Hello 世界 مرحبا";
const encrypted = try cipher.encrypt(multilingual, "demo");
defer allocator.free(encrypted);

// Byte length preserved
std.debug.assert(encrypted.len == multilingual.len);

// Valid UTF-8 preserved
std.debug.assert(std.unicode.utf8ValidateSlice(encrypted));

Class Sequence Permutation Demo

const cipher = try Utf8Cipher.init(allocator, &key, false);
defer cipher.deinit();

// Example showing how class sequences are permuted
const plaintext = "AB世";  // Classes: [ASCII, ASCII, 3-byte]
const ciphertext = try cipher.encrypt(plaintext, "test");
defer allocator.free(ciphertext);

// Ciphertext has SAME multiset of classes but DIFFERENT order
// For example, might produce: [3-byte, ASCII, ASCII]
// Total byte length: 5 bytes (unchanged)
// Each character stayed in its class (ASCII→ASCII, 3-byte→3-byte)
// But the ORDER was permuted based on content and key

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.zig		build.zig
build.zig.zon		build.zig.zon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

utf8encrypt

Features

Quick Start

How It Works

UTF-8 Byte-Length Classes

Encryption Process

Chaining Mode

Use Cases

Security Considerations

Cryptographic Properties

Limitations

Performance

Encryption Speed by Class

Sample Output

ASCII Text

Multi-Byte UTF-8 (Latin + Chinese)

Emoji (4-Byte UTF-8)

Accented Characters

Examples

Basic Encryption/Decryption

With Boundary Space Avoidance

Unique Tweaks for Different Contexts

Multi-Byte UTF-8

Class Sequence Permutation Demo

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jedisct1/zig-utf8encrypt

Folders and files

Latest commit

History

Repository files navigation

utf8encrypt

Features

Quick Start

How It Works

UTF-8 Byte-Length Classes

Encryption Process

Chaining Mode

Use Cases

Security Considerations

Cryptographic Properties

Limitations

Performance

Encryption Speed by Class

Sample Output

ASCII Text

Multi-Byte UTF-8 (Latin + Chinese)

Emoji (4-Byte UTF-8)

Accented Characters

Examples

Basic Encryption/Decryption

With Boundary Space Avoidance

Unique Tweaks for Different Contexts

Multi-Byte UTF-8

Class Sequence Permutation Demo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages