datas3t

A high-performance data management system for storing, indexing, and retrieving large-scale datas3ts in S3-compatible object storage.

Overview

datas3t is designed for efficiently managing datas3ts containing millions of individual files (called "datapoints"). It stores files as indexed TAR archives in S3-compatible storage, enabling fast random access without the overhead of extracting entire archives.

Key Features

🗜️ Efficient Storage

Packs individual files into TAR archives
Eliminates S3 object overhead for small files
Supports datas3ts with millions of datapoints

⚡ Fast Random Access

Creates lightweight indices for TAR archives
Enables direct access to specific files without extraction
Disk-based caching for frequently accessed indices

🔒 Flexible TLS Configuration

TLS usage determined by endpoint protocol (https:// vs http://)
No separate TLS flags needed - follows standard URL conventions
Seamless integration with various S3-compatible services

📦 Range-based Operations

Upload and download data in configurable chunks (dataranges)
Supports partial datas3t retrieval
Parallel processing of multiple ranges

🔗 Direct Client-to-Storage Transfer

Uses S3 presigned URLs for efficient data transfer
Bypasses server for large file operations
Supports multipart uploads for large datas3ts

🔄 Datarange Aggregation

Combines multiple small dataranges into larger ones
Reduces S3 object count and improves download performance
Validates continuous datapoint coverage before aggregation
Atomic operations with automatic cleanup on failure

📦 Datas3t Import

Discovers and imports existing datas3ts from S3 buckets
Scans for objects matching datas3t patterns automatically
Disaster recovery and migration support
Maintains upload counter consistency
Prevents duplicate imports with idempotent operations

🛡️ Data Integrity

Validates TAR structure and file naming conventions
Ensures datapoint consistency across operations
Transactional database operations

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────────┐
│   Client/CLI    │───▶│   HTTP API       │───▶│   PostgreSQL        │
│                 │    │   Server         │    │   (Metadata)        │
└─────────────────┘    └──────────────────┘    └─────────────────────┘
                                │
                                ▼
                       ┌──────────────────┐
                       │  S3-Compatible   │
                       │  Object Storage  │
                       │  (TAR Archives)  │
                       └──────────────────┘

Components

HTTP API Server: REST API for datas3t management
Client Library: Go SDK for programmatic access
PostgreSQL Database: Stores metadata and indices
S3-Compatible Storage: Stores TAR archives and indices
TAR Indexing Engine: Creates fast-access indices
Disk Cache: Local caching for performance
Key Deletion Service: Background worker for automatic S3 object cleanup

Core Concepts

Datas3ts

Named collections of related datapoints. Each datas3t is associated with an S3 bucket configuration.

Datapoints

Individual files within a datas3t, numbered sequentially:

00000000000000000001.txt
00000000000000000002.jpg
00000000000000000003.json

Dataranges

Contiguous chunks of datapoints stored as TAR archives:

datas3t/my-datas3t/dataranges/00000000000000000001-00000000000000001000-000000000001.tar
datas3t/my-datas3t/dataranges/00000000000000001001-00000000000000002000-000000000002.tar

TAR Indices

Lightweight index files enabling fast random access:

datas3t/my-datas3t/dataranges/00000000000000000001-00000000000000001000-000000000001.index

Import Operations

Process of discovering and importing existing datas3ts from S3 buckets:

Pattern Recognition: Automatically detects objects matching datas3t naming conventions
Duplicate Prevention: Skips existing dataranges to prevent conflicts
Upload Counter Management: Maintains counter consistency for future uploads
Transaction Safety: All imports are performed atomically per datas3t

Aggregation Operations

Process of combining multiple small dataranges into larger ones for improved efficiency:

Coverage Validation: Ensures continuous datapoint coverage with no gaps
Atomic Replacement: Original dataranges are replaced atomically after successful aggregation
Parallel Processing: Downloads and uploads are performed in parallel for optimal performance
Multipart Support: Large aggregates use multipart uploads for reliability

Quick Start

Prerequisites

Nix with flakes enabled (recommended)
Go 1.24.3+
PostgreSQL 12+
S3-compatible storage (AWS S3, MinIO, etc.)

Development Setup

# Clone the repository
git clone https://github.com/draganm/datas3t.git
cd datas3t

# Enter development environment
nix develop

# Run tests
nix develop -c go test ./...

# Generate code
nix develop -c go generate ./...

Running the Server

# Set environment variables
export DB_URL="postgres://user:password@localhost:5432/datas3t"
export ADDR=":8765"

# Run the server
nix develop -c go run ./cmd/datas3t server

# Server will start on http://localhost:8765

API Usage

1. Configure S3 Bucket

curl -X POST http://localhost:8765/api/bucket \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-bucket-config",
    "endpoint": "https://s3.amazonaws.com",
    "bucket": "my-data-bucket",
    "access_key": "ACCESS_KEY",
    "secret_key": "SECRET_KEY"
  }'

2. Create Datas3t

curl -X POST http://localhost:8765/api/datas3t \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-datas3t",
    "bucket": "my-bucket-config"
  }'

3. Upload Datarange

# Start upload
curl -X POST http://localhost:8765/api/datarange/upload/start \
  -H "Content-Type: application/json" \
  -d '{
    "datas3t_name": "my-datas3t",
    "first_datapoint_index": 1,
    "number_of_datapoints": 1000,
    "data_size": 1048576
  }'

# Use returned presigned URLs to upload TAR archive and index
# Then complete the upload
curl -X POST http://localhost:8765/api/datarange/upload/complete \
  -H "Content-Type: application/json" \
  -d '{
    "datarange_upload_id": 123
  }'

4. Download Datapoints

curl -X POST http://localhost:8765/api/download/presign \
  -H "Content-Type: application/json" \
  -d '{
    "datas3t_name": "my-datas3t",
    "first_datapoint": 100,
    "last_datapoint": 200
  }'

5. Import Existing Datas3ts

# Import datas3ts from S3 bucket
curl -X POST http://localhost:8765/api/v1/datas3ts/import \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "my-bucket-config"
  }'

6. Clear Datas3t

# Clear all dataranges from a datas3t
curl -X POST http://localhost:8765/api/v1/datas3ts/clear \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-datas3t"
  }'

7. Aggregate Dataranges

# Start aggregation
curl -X POST http://localhost:8765/api/v1/aggregate \
  -H "Content-Type: application/json" \
  -d '{
    "datas3t_name": "my-datas3t",
    "first_datapoint_index": 1,
    "last_datapoint_index": 5000
  }'

# Complete aggregation (after processing returned URLs)
curl -X POST http://localhost:8765/api/v1/aggregate/complete \
  -H "Content-Type: application/json" \
  -d '{
    "aggregate_upload_id": 456
  }'

Client Library Usage

package main

import (
    "context"
    "github.com/draganm/datas3t/client"
)

func main() {
    // Create client
    c := client.New("http://localhost:8765")
    
    // List datas3ts
datas3ts, err := c.ListDatas3ts(context.Background())
    if err != nil {
        panic(err)
    }
    
    // Download specific datapoints
    response, err := c.PreSignDownloadForDatapoints(context.Background(), &client.PreSignDownloadForDatapointsRequest{
        Datas3tName:    "my-datas3t",
        FirstDatapoint: 1,
        LastDatapoint:  100,
    })
    if err != nil {
        panic(err)
    }
    
    // Use presigned URLs to download data directly from S3
    for _, segment := range response.DownloadSegments {
        // Download using segment.PresignedURL and segment.Range
    }
    
    // Import existing datas3ts from S3 bucket
    importResponse, err := c.ImportDatas3t(context.Background(), &client.ImportDatas3tRequest{
        BucketName: "my-bucket-config",
    })
    if err != nil {
        panic(err)
    }
    fmt.Printf("Imported %d datas3ts: %v\n", importResponse.ImportedCount, importResponse.ImportedDatas3ts)
    
    // Clear all dataranges from a datas3t
    clearResponse, err := c.ClearDatas3t(context.Background(), &client.ClearDatas3tRequest{
        Name: "my-datas3t",
    })
    if err != nil {
        panic(err)
    }
    fmt.Printf("Cleared datas3t: deleted %d dataranges, scheduled %d objects for deletion\n", 
        clearResponse.DatarangesDeleted, clearResponse.ObjectsScheduled)
    
    // Aggregate multiple dataranges into a single larger one
    err = c.AggregateDataRanges(context.Background(), "my-datas3t", 1, 5000, &client.AggregateOptions{
        MaxParallelism: 8,
        MaxRetries:     3,
        ProgressCallback: func(phase string, current, total int64) {
            fmt.Printf("Phase %s: %d/%d\n", phase, current, total)
        },
    })
    if err != nil {
        panic(err)
    }
}

CLI Usage

The datas3t CLI provides a comprehensive command-line interface for managing buckets, datas3ts, datarange operations, and aggregation.

Building the CLI

# Build the CLI binary
nix develop -c go build -o datas3t ./cmd/datas3t

# Or run directly
nix develop -c go run ./cmd/datas3t [command]

Global Options

All commands support:

--server-url - Server URL (default: http://localhost:8765, env: DATAS3T_SERVER_URL)

Server Management

Start the Server

# Start the datas3t server
./datas3t server \
  --db-url "postgres://user:password@localhost:5432/datas3t" \
  --cache-dir "/path/to/cache" \
  --encryption-key "your-base64-encoded-key"

# Using environment variables
export DB_URL="postgres://user:password@localhost:5432/datas3t"
export CACHE_DIR="/path/to/cache"
export ENCRYPTION_KEY="your-encryption-key"
./datas3t server

Generate Encryption Key

# Generate a new AES-256 encryption key
./datas3t keygen

Bucket Management

Add S3 Bucket Configuration

./datas3t bucket add \
  --name my-bucket-config \
  --endpoint https://s3.amazonaws.com \
  --bucket my-data-bucket \
  --access-key ACCESS_KEY \
  --secret-key SECRET_KEY

Options:

--name - Bucket configuration name (required)
--endpoint - S3 endpoint (include https:// for TLS) (required)
--bucket - S3 bucket name (required)
--access-key - S3 access key (required)
--secret-key - S3 secret key (required)

List Bucket Configurations

# List all bucket configurations
./datas3t bucket list

# Output as JSON
./datas3t bucket list --json

Datas3t Management

Add New Datas3t

./datas3t datas3t add \
  --name my-dataset \
  --bucket my-bucket-config

Options:

--name - Datas3t name (required)
--bucket - Bucket configuration name (required)

List Datas3ts

# List all datas3ts with statistics
./datas3t datas3t list

# Output as JSON
./datas3t datas3t list --json

Import Existing Datas3ts

# Import datas3ts from S3 bucket
./datas3t datas3t import \
  --bucket my-bucket-config

# Output as JSON
./datas3t datas3t import \
  --bucket my-bucket-config \
  --json

Options:

--bucket - Bucket configuration name to scan for existing datas3ts (required)
--json - Output results as JSON

Clear Datas3t

# Clear all dataranges from a datas3t (with confirmation prompt)
./datas3t datas3t clear \
  --name my-dataset

# Clear without confirmation prompt
./datas3t datas3t clear \
  --name my-dataset \
  --force

Options:

--name - Datas3t name to clear (required)
--force - Skip confirmation prompt

What it does:

Removes all dataranges from the specified datas3t
Schedules all associated S3 objects (TAR files and indices) for deletion
Keeps the datas3t record itself (allows future uploads)
The datas3t remains in the database with zero dataranges and datapoints
S3 objects are deleted by the background worker within 24 hours

TAR Upload Operations

Upload TAR File

./datas3t upload-tar \
  --datas3t my-dataset \
  --file /path/to/data.tar \
  --max-parallelism 8 \
  --max-retries 5

Options:

--datas3t - Datas3t name (required)
--file - Path to TAR file to upload (required)
--max-parallelism - Maximum concurrent uploads (default: 4)
--max-retries - Maximum retry attempts per chunk (default: 3)

Datarange Operations

Download Datapoints as TAR

./datas3t datarange download-tar \
  --datas3t my-dataset \
  --first-datapoint 1 \
  --last-datapoint 1000 \
  --output /path/to/downloaded.tar \
  --max-parallelism 8 \
  --max-retries 5 \
  --chunk-size 10485760

Options:

--datas3t - Datas3t name (required)
--first-datapoint - First datapoint to download (required)
--last-datapoint - Last datapoint to download (required)
--output - Output TAR file path (required)
--max-parallelism - Maximum concurrent downloads (default: 4)
--max-retries - Maximum retry attempts per chunk (default: 3)
--chunk-size - Download chunk size in bytes (default: 5MB)

Aggregation Operations

Aggregate Multiple Dataranges

./datas3t aggregate \
  --datas3t my-dataset \
  --first-datapoint 1 \
  --last-datapoint 5000 \
  --max-parallelism 4 \
  --max-retries 3

Options:

--datas3t - Datas3t name (required)
--first-datapoint - First datapoint index to include in aggregate (required)
--last-datapoint - Last datapoint index to include in aggregate (required)
--max-parallelism - Maximum number of concurrent operations (default: 4)
--max-retries - Maximum number of retry attempts per operation (default: 3)

What it does:

Downloads all source dataranges in the specified range
Merges them into a single TAR archive with continuous datapoint numbering
Uploads the merged archive to S3
Atomically replaces the original dataranges with the new aggregate
Validates that the datapoint range is fully covered by existing dataranges with no gaps

Optimization Operations

Optimize Datarange Storage

./datas3t optimize \
  --datas3t my-dataset \
  --dry-run \
  --min-score 2.0 \
  --target-size 2GB

Options:

--datas3t - Datas3t name (required)
--dry-run - Show optimization recommendations without executing them
--daemon - Run continuously, monitoring for optimization opportunities
--interval - Interval between optimization checks in daemon mode (default: 5m)
--min-score - Minimum AVS score required to perform aggregation (default: 1.0)
--target-size - Target size for aggregated files (default: 1GB)
--max-aggregate-size - Maximum size for aggregated files (default: 5GB)
--max-operations - Maximum number of aggregation operations per run (default: 10)
--max-parallelism - Maximum number of concurrent operations for each aggregation (default: 4)
--max-retries - Maximum number of retry attempts per operation (default: 3)

What it does:

Analyzes existing dataranges to identify optimization opportunities
Uses an Aggregation Value Score (AVS) algorithm to prioritize operations
Automatically performs beneficial aggregations using the existing aggregate functionality
Supports both one-time optimization and continuous monitoring modes

Optimization Strategies:

Small file aggregation: Combines many small files into larger ones
Adjacent ID range aggregation: Merges consecutive datapoint ranges
Size bucket aggregation: Groups similarly sized files together

Scoring Algorithm: Each potential aggregation is scored based on:

Objects reduced: Fewer files to manage (reduces S3 object count)
Size efficiency: How close the result is to the target size
Consecutive bonus: Bonus for adjacent datapoint ranges
Operation cost: Download/upload overhead consideration

Example Usage:

# One-time optimization with dry-run to see recommendations
./datas3t optimize \
  --datas3t my-dataset \
  --dry-run

# Execute optimization with custom thresholds
./datas3t optimize \
  --datas3t my-dataset \
  --min-score 2.0 \
  --target-size 2GB \
  --max-operations 5

# Continuous monitoring mode
./datas3t optimize \
  --datas3t my-dataset \
  --daemon \
  --interval 5m

# Show all available optimization opportunities
./datas3t optimize \
  --datas3t my-dataset \
  --dry-run \
  --min-score 0.5 \
  --max-operations 20

Benefits:

Intelligent optimization: Automatically identifies the best aggregation opportunities
Cost reduction: Reduces S3 object count and storage costs
Performance improvement: Fewer, larger files improve download performance
Hands-off operation: Can run continuously to maintain optimal storage layout
Safe operations: Uses existing battle-tested aggregation system
Flexible configuration: Customizable thresholds and strategies

Manual Aggregation Example

# Example: You have uploaded multiple small TAR files and want to consolidate them
# First, check your current dataranges
./datas3t datas3t list

# Aggregate the first 10,000 datapoints into a single larger datarange
./datas3t aggregate \
  --datas3t my-dataset \
  --first-datapoint 1 \
  --last-datapoint 10000 \
  --max-parallelism 6

# Check the result - you should see fewer, larger dataranges
./datas3t datas3t list

Benefits:

Reduces the number of S3 objects (lower storage costs)
Improves download performance for large ranges
Maintains all data integrity and accessibility
Can be run multiple times to further consolidate data

Complete Workflow Example

# 1. Generate encryption key
./datas3t keygen
export ENCRYPTION_KEY="generated-key-here"

# 2. Start server
./datas3t server &

# 3. Add bucket configuration
./datas3t bucket add \
  --name production-bucket \
  --endpoint https://s3.amazonaws.com \
  --bucket my-production-data \
  --access-key "$AWS_ACCESS_KEY" \
  --secret-key "$AWS_SECRET_KEY"

# 4. Create datas3t
./datas3t datas3t add \
  --name image-dataset \
  --bucket production-bucket

# 5. Upload data
./datas3t upload-tar \
  --datas3t image-dataset \
  --file ./images-batch-1.tar

# 6. List datasets
./datas3t datas3t list

# 7. Import existing datas3ts (disaster recovery/migration)
./datas3t datas3t import \
  --bucket production-bucket

# 8. Download specific range
./datas3t datarange download-tar \
  --datas3t image-dataset \
  --first-datapoint 100 \
  --last-datapoint 200 \
  --output ./images-100-200.tar

# 9. Optimize datarange storage automatically
./datas3t optimize \
  --datas3t image-dataset \
  --dry-run

# 10. Execute optimization
./datas3t optimize \
  --datas3t image-dataset \
  --min-score 2.0

# 11. Or aggregate specific ranges manually
./datas3t aggregate \
  --datas3t image-dataset \
  --first-datapoint 1 \
  --last-datapoint 10000

# 12. Clear all data from a datas3t (keeping the datas3t record)
./datas3t datas3t clear \
  --name image-dataset \
  --force

# 13. Check results after optimization/aggregation/clear
./datas3t datas3t list

Environment Variables

All CLI commands support these environment variables:

DATAS3T_SERVER_URL - Default server URL for all commands
DB_URL - Database connection string (server command)
CACHE_DIR - Cache directory path (server command)
ENCRYPTION_KEY - Base64-encoded encryption key (server command)

File Naming Convention

Datapoints must follow the naming pattern %020d.<extension>:

✅ 00000000000000000001.txt
✅ 00000000000000000042.jpg
✅ 00000000000000001337.json
❌ file1.txt
❌ 1.txt
❌ 001.txt

Performance Characteristics

Storage Efficiency

Small Files: 99%+ storage efficiency vs individual S3 objects
Large Datas3ts: Linear scaling with datas3t size

Access Performance

Index Lookup: O(1) file location within TAR
Range Queries: Optimized byte-range requests
Caching: Local disk cache for frequently accessed indices

Scalability

Concurrent Operations: Supports parallel uploads/downloads
Large Datas3ts: Tested with millions of datapoints
Distributed: Stateless server design for horizontal scaling

Contributing

Currently we are not accepting contributions to this project.

Architecture Details

Database Schema

s3_buckets: S3 configuration storage
datas3ts: Datas3t metadata
dataranges: TAR archive metadata and byte ranges
datarange_uploads: Temporary upload state management
aggregate_uploads: Aggregation operation tracking and state management
keys_to_delete: Immediate deletion queue for obsolete S3 objects

TAR Index Format

Binary format with 16-byte entries per file:

Bytes 0-7: File position in TAR (big-endian uint64)
Bytes 8-9: Header blocks count (big-endian uint16)
Bytes 10-15: File size (big-endian, 48-bit)

Caching Strategy

Memory: In-memory index objects during operations
Disk: Persistent cache for TAR indices
LRU Eviction: Automatic cleanup based on access patterns
Cache Keys: SHA-256 hash of datarange metadata

Key Deletion Service

Background Worker: Automatic cleanup of obsolete S3 objects
Batch Processing: Processes 5 deletion requests at a time
Immediate Processing: No delay between queuing and deletion
Error Handling: Retries failed deletions, logs errors without blocking
Database Consistency: Atomic removal from deletion queue after successful S3 deletion
Graceful Shutdown: Respects context cancellation for clean server shutdown

License

This project is licensed under the AGPLV3 License - see the LICENSE file for details.

Support

For questions, issues, or contributions:

Open an issue on GitHub
Check existing documentation
Review test files for usage examples

Installation

git clone https://github.com/draganm/datas3t.git
cd datas3t
nix develop -c make build

Configuration

Database Setup

Create a PostgreSQL database and set the connection string:

export DB_URL="postgres://user:password@localhost:5432/datas3t"
export CACHE_DIR="/path/to/cache"

S3 Credential Encryption

Important: S3 credentials are encrypted at rest using AES-256-GCM with unique random nonces.

The encryption system provides the following security features:

AES-256-GCM encryption: Industry-standard authenticated encryption
Unique nonces: Each encryption uses a random nonce, so identical credentials produce different encrypted values
Key derivation: Input keys are SHA-256 hashed to ensure proper 32-byte key size
Authenticated encryption: Protects against tampering and ensures data integrity
Transparent operation: All S3 operations automatically encrypt/decrypt credentials

Key Generation

Generate a cryptographically secure 256-bit encryption key:

nix develop -c go run ./cmd/datas3t/main.go keygen

This generates a 32-byte (256-bit) random key encoded as base64. Store this key securely and set it as an environment variable:

export ENCRYPTION_KEY="your-generated-key-here"

Alternative Key Generation

You can also use datas3t keygen if you have built the binary:

./datas3t keygen

Critical Security Notes:

Keep this key secure and backed up! If you lose it, you won't be able to decrypt your stored S3 credentials
The same key must be used consistently across server restarts
Changing the key will make existing encrypted credentials unreadable
Store the key separately from your database backups for additional security

Starting the Server

./datas3t server --db-url "$DB_URL" --cache-dir "$CACHE_DIR" --encryption-key "$ENCRYPTION_KEY"

Or using environment variables:

export DB_URL="postgres://user:password@localhost:5432/datas3t"
export CACHE_DIR="/path/to/cache"  
export ENCRYPTION_KEY="your-encryption-key"
./datas3t server

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
.vscode		.vscode
aws		aws
client		client
cmd/datas3t		cmd/datas3t
crypto		crypto
httpapi		httpapi
postgresstore		postgresstore
server		server
tarindex		tarindex
.cursorrules		.cursorrules
.envrc		.envrc
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
client.go		client.go
e2e_suite_test.go		e2e_suite_test.go
e2e_test.go		e2e_test.go
flake.lock		flake.lock
flake.nix		flake.nix
go.mod		go.mod
go.sum		go.sum

License

draganm/datas3t

Folders and files

Latest commit

History

Repository files navigation

datas3t

Overview

Key Features

🗜️ Efficient Storage

⚡ Fast Random Access

🔒 Flexible TLS Configuration

📦 Range-based Operations

🔗 Direct Client-to-Storage Transfer

🔄 Datarange Aggregation

📦 Datas3t Import

🛡️ Data Integrity

Architecture

Components

Core Concepts

Datas3ts

Datapoints

Dataranges

TAR Indices

Import Operations

Aggregation Operations

Quick Start

Prerequisites

Development Setup

Running the Server

API Usage

1. Configure S3 Bucket

2. Create Datas3t

3. Upload Datarange

4. Download Datapoints

5. Import Existing Datas3ts

6. Clear Datas3t

7. Aggregate Dataranges

Client Library Usage

CLI Usage

Building the CLI

Global Options

Server Management

Start the Server

Generate Encryption Key

Bucket Management

Add S3 Bucket Configuration

List Bucket Configurations

Datas3t Management

Add New Datas3t

List Datas3ts

Import Existing Datas3ts

Clear Datas3t

TAR Upload Operations

Upload TAR File

Datarange Operations

Download Datapoints as TAR

Aggregation Operations

Aggregate Multiple Dataranges

Optimization Operations

Optimize Datarange Storage

Manual Aggregation Example

Complete Workflow Example

Environment Variables

File Naming Convention

Performance Characteristics

Storage Efficiency

Access Performance

Scalability

Contributing

Architecture Details

Database Schema

TAR Index Format

Caching Strategy

Key Deletion Service

License

Support

Installation

Configuration

Database Setup

Packages