Overview

Relevant source files

Apache Fluss (Incubating) is a streaming storage system designed for real-time analytics and AI workloads. It serves as the real-time data layer in Lakehouse architectures, bridging the gap between high-throughput streaming ingestion and analytical query engines.

What is Apache Fluss?

Traditional data lakes excel at batch analytics but struggle with real-time requirements. Streaming systems like Apache Kafka provide low-latency data ingestion but lack efficient point queries and table semantics. Fluss fills this gap by providing:

Streaming Storage with Table Semantics

Sub-second data freshness for real-time analytics
Efficient point lookups for primary key queries (high QPS)
Columnar streaming using Apache Arrow for I/O optimization
Mutable tables with UPDATE/DELETE support
Automatic lakehouse tiering to Paimon, Iceberg, or Lance

Key Use Cases:

Real-time dashboards and metrics
Feature stores for ML/AI pipelines
Real-time customer 360 views
Streaming ETL and data warehouses
Low-latency CDC (Change Data Capture)

Fluss (German: "river", pronounced /flus/) enables streaming data to flow continuously into data lakes, like a river converging into larger bodies of water.

Sources: README.md39-56 website/docs/intro.md1-34

High-Level Architecture

Apache Fluss uses a three-tier architecture that separates concerns across client, cluster, and storage layers. The system provides both real-time streaming and lakehouse integration through a unified table abstraction.

Overall System Architecture

Architecture Highlights:

Layer	Components	Key Responsibilities
Client	`Connection`, `Admin`, Flink connectors	DDL/DML operations, catalog integration
Control Plane	`CoordinatorServer`, ZooKeeper	Metadata, cluster coordination, security
Data Plane	`TabletServer` instances	Data storage, replication, query serving
Storage	Local disk, object storage, lakehouse	Hot/warm/cold data tiers

Client Discovery:

Clients connect via bootstrap.servers (any CoordinatorServer or TabletServer address)
Automatic cluster topology discovery after initial connection
Load-balanced requests across TabletServers

Sources: website/docs/install-deploy/overview.md12-135 website/docs/engine-flink/getting-started.md86-107 README.md39-56 Diagram 1 from high-level diagrams

Key Concepts

Table Types

Fluss provides two table types optimized for different workloads, each backed by different storage implementations:

Table Type Architecture

Feature	Log Table	Primary Key Table
Storage	`LogTablet` only	`LogTablet` + `KvTablet` (dual)
Write Operations	INSERT (append)	INSERT/UPDATE/DELETE (upsert)
Primary Use Case	Event streams, time-series	Dimension tables, aggregations
Point Queries	❌ No	✅ Yes (via `KvTablet`)
Streaming Reads	✅ Yes	✅ Yes (via changelog)
Storage Format	Arrow columnar	Arrow (log) + RocksDB (KV)

Example DDL:

For detailed design patterns and use cases, see Table Types and Data Model.

Sources: website/docs/engine-flink/ddl.md61-91 website/docs/quickstart/flink.md237-268 Diagram 2 from high-level diagrams

Data Distribution

Fluss distributes data using buckets (for parallelism) and partitions (for lifecycle management). Each bucket is the unit of replication and assignment to TabletServer nodes.

Data Distribution Model

Key Concepts:

Concept	Description	Configuration
Bucket	Unit of distribution and replication	`bucket.num` (required)
Bucket Key	Field(s) used for bucketing	`bucket.key` (default: primary key)
Partition	Lifecycle management unit	`PARTITIONED BY (field)`
Replication Factor	Copies per bucket	`bucket.replication.factor` (default: 1)
ISR	In-sync replicas eligible for leadership	Managed by `ReplicaManager`

Partition Management:

Manual partitions: ALTER TABLE ADD PARTITION (dt='2024-01-01')
Auto-partitioning: table.auto-partition.enabled = 'true' with table.auto-partition.time-unit
Multi-field partitions: Composite keys like (region, dt)

For detailed bucketing and partitioning strategies, see Data Distribution and Replication.

Sources: website/docs/table-design/data-distribution/partitioning.md1-95 website/docs/engine-flink/options.md66-74 Diagram 6 from high-level diagrams

Storage Architecture

Fluss implements a three-tier storage model to optimize for different data access patterns:

Three-Tier Storage Model

Tier Characteristics:

Tier	Storage	Latency	Use Case	Configuration
Hot	Local disk	Microseconds	Recent data, point queries	`data.dir`
Warm	Object storage	Milliseconds	Older data, recovery	`remote.data.dir`
Cold	Lakehouse	Seconds	Analytics, reporting	`datalake.*` properties

Tiering Configuration Example:

Table-Level Tiering:

Sources: website/docs/maintenance/configuration.md36-39 website/docs/quickstart/lakehouse.md36-114 Diagram 3 from high-level architecture

Integration with Compute Engines

Apache Flink Integration

Fluss provides deep integration with Apache Flink through the Flink Connector (fluss-flink-*.jar), which implements Flink's catalog, source, and sink interfaces.

Flink Connector Architecture

Connector Capabilities:

Operation	Support	Implementation
DDL	CREATE/ALTER/DROP TABLE, partitions	`FlinkCatalog` → `Admin`
Streaming Read	✅ Full/incremental/latest	`FlinkTableSource`
Batch Read	✅ Snapshot, point lookup, LIMIT	`FlinkTableSource`
Write	✅ INSERT/UPDATE/DELETE	`FlinkTableSink`
Lookup Join	✅ Async/sync, prefix lookup	`FlinkLookupFunction`

Catalog Creation:

Supported Flink Versions: 1.18, 1.19, 1.20, 2.1, 2.2

For detailed Flink operations, see Flink Integration.

Sources: website/docs/engine-flink/getting-started.md1-232 website/docs/engine-flink/ddl.md9-33 Diagram 3 from high-level diagrams

Lakehouse Integration

Fluss provides automatic tiering to lakehouse formats through a Flink-based tiering service, unifying real-time and historical data under a single table abstraction.

Lakehouse Tiering Architecture

Tiering Service Features:

Feature	Description
Continuous Tiering	Flink job subscribes to Fluss CDC and writes to lakehouse
Exactly-Once	Offset tracking via CoordinatorServer ensures no data loss
Union Read	Flink queries automatically merge real-time + historical data
Lake-First	Tables created in lake catalog before Fluss registration
System Columns	`__bucket`, `__offset`, `__timestamp` preserve lineage

Supported Lake Formats:

Apache Paimon: Streaming-first, LSM architecture
Apache Iceberg: MOR strategy, ecosystem support
Lance: Arrow-native, ML/AI workloads

Configuration:

For detailed lakehouse setup, see Streaming Lakehouse Integration.

Sources: website/docs/quickstart/lakehouse.md1-290 website/docs/streaming-lakehouse/overview.md1-47 Diagram 4 from high-level diagrams

Configuration and Deployment

Configuration System

Fluss uses a hierarchical configuration system with multiple levels of precedence:

Configuration Hierarchy

Key Configuration Files:

conf/server.yaml - Main configuration (parsed on startup)
ConfigOptions.java - Option definitions with defaults and validation
ZooKeeper paths - Runtime cluster and table configuration

Example Configuration:

Sources: website/docs/maintenance/configuration.md1-130 fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java1-460

Deployment Options

Fluss supports multiple deployment modes for different environments:

Deployment	Use Case	Setup Complexity	HA Support
Local Cluster	Development, testing	Low (single command)	No
Distributed Cluster	Production	Medium (manual setup)	Yes (with ZK)
Docker/Compose	Quick start, CI/CD	Low (docker compose)	No
Kubernetes/Helm	Cloud production	Medium (Helm charts)	Yes

Quick Start:

Docker Compose Example:

Sources: website/docs/install-deploy/overview.md137-145 website/docs/install-deploy/deploying-local-cluster.md1-76 website/docs/install-deploy/deploying-with-docker.md1-367

Observability and Monitoring

Fluss provides comprehensive observability through metrics, logs, and integration with standard monitoring tools.

Observability Stack

Metrics Configuration:

Key Metric Categories:

Coordinator Metrics: Cluster state, metadata operations
TabletServer Metrics: Write throughput, read latency, replication lag
Table Metrics: Per-table/bucket statistics
RocksDB Metrics: KV store performance

Logging Configuration:

Default: Log4j 2 (conf/log4j.properties)
Alternative: Logback (conf/logback.xml)
Log locations: log/ directory in Fluss installation
Log rotation: Automatic daily rotation

Sources: website/docs/maintenance/observability/quickstart.md1-216 website/docs/maintenance/observability/metric-reporters.md1-125 website/docs/maintenance/observability/logging.md1-70

Getting Started

To start using Apache Fluss:

1. Quick Start with Docker:

2. Create Your First Table:

3. Enable Lakehouse Tiering:

Where to Go Next:

This overview provides foundational concepts. For deeper exploration:

Topic	Wiki Page	Description
Architecture	Architecture	CoordinatorServer, TabletServer, ReplicaManager details
Table Design	Table Types and Data Model	Schema design, primary keys, partitioning strategies
Storage	Storage Architecture	LogTablet, KvTablet, Arrow format, RocksDB internals
Replication	Data Distribution and Replication	ISR, leader election, bucket assignment
Configuration	Configuration System	ConfigOptions, server.yaml, dynamic reconfiguration
Flink Integration	Flink Integration	DDL/DML operations, lookup joins, streaming/batch reads
Lakehouse	Streaming Lakehouse	Paimon/Iceberg/Lance integration, union reads
Deployment	Deployment Options	Local cluster, distributed cluster, Kubernetes
Monitoring	Observability	Metrics, logging, Prometheus, Grafana

Quick Links:

Quickstart - Hands-on tutorial
Quickstart - Lakehouse integration guide
Downloads - Latest releases and connectors

Sources: website/docs/quickstart/flink.md1-411 website/docs/intro.md35-42 README.md57-72

Key Features

Apache Fluss provides several unique capabilities that distinguish it as a streaming storage system:

Core Features

Feature Highlights:

Feature	Benefit	Implementation
Sub-Second Freshness	Real-time analytics	Arrow columnar format, tail reads
Column Pruning	10x read speedup	`ColumnarRow` with selective column reads
High-QPS Lookups	Dimension enrichment	`KvTablet` with RocksDB
Union Reads	Real-time + historical	`UnionSplitReader` in Flink connector
Partial Updates	Efficient upserts	Pre-write buffer in `KvTablet`
Exactly-Once	No data loss	ISR replication, checkpoint-based tiering

For architectural details, see Architecture.

Sources: README.md48-56 website/docs/intro.md1-34 website/src/components/HomepageFeatures/index.tsx30-67

Where to Go Next

This overview provides a foundation for understanding Apache Fluss. For deeper dives:

Getting Started: See Flink Quickstart for hands-on tutorial
Architecture Details: See Architecture for component internals
Table Design: See Table Types and Data Distribution for data modeling
Deployment: See Configuration and Deployment for production setup
Flink Integration: See Flink Integration for SQL operations and connector options
Lakehouse: See Streaming Lakehouse Integration for Paimon/Iceberg/Lance integration
Operations: See Operations and Maintenance for monitoring and management

Sources: website/docs/intro.md35-42

Overview

Relevant source files

What is Apache Fluss?

Streaming Storage with Table Semantics

Sub-second data freshness for real-time analytics
Efficient point lookups for primary key queries (high QPS)
Columnar streaming using Apache Arrow for I/O optimization
Mutable tables with UPDATE/DELETE support
Automatic lakehouse tiering to Paimon, Iceberg, or Lance

Key Use Cases:

Real-time dashboards and metrics
Feature stores for ML/AI pipelines
Real-time customer 360 views
Streaming ETL and data warehouses
Low-latency CDC (Change Data Capture)

Fluss (German: "river", pronounced /flus/) enables streaming data to flow continuously into data lakes, like a river converging into larger bodies of water.

Sources: README.md39-56 website/docs/intro.md1-34

High-Level Architecture

Overall System Architecture

Architecture Highlights:

Layer	Components	Key Responsibilities
Client	`Connection`, `Admin`, Flink connectors	DDL/DML operations, catalog integration
Control Plane	`CoordinatorServer`, ZooKeeper	Metadata, cluster coordination, security
Data Plane	`TabletServer` instances	Data storage, replication, query serving
Storage	Local disk, object storage, lakehouse	Hot/warm/cold data tiers

Client Discovery:

Clients connect via bootstrap.servers (any CoordinatorServer or TabletServer address)
Automatic cluster topology discovery after initial connection
Load-balanced requests across TabletServers

Sources: website/docs/install-deploy/overview.md12-135 website/docs/engine-flink/getting-started.md86-107 README.md39-56 Diagram 1 from high-level diagrams

Key Concepts

Table Types

Fluss provides two table types optimized for different workloads, each backed by different storage implementations:

Table Type Architecture

Feature	Log Table	Primary Key Table
Storage	`LogTablet` only	`LogTablet` + `KvTablet` (dual)
Write Operations	INSERT (append)	INSERT/UPDATE/DELETE (upsert)
Primary Use Case	Event streams, time-series	Dimension tables, aggregations
Point Queries	❌ No	✅ Yes (via `KvTablet`)
Streaming Reads	✅ Yes	✅ Yes (via changelog)
Storage Format	Arrow columnar	Arrow (log) + RocksDB (KV)

Example DDL:

For detailed design patterns and use cases, see Table Types and Data Model.

Sources: website/docs/engine-flink/ddl.md61-91 website/docs/quickstart/flink.md237-268 Diagram 2 from high-level diagrams

Data Distribution

Fluss distributes data using buckets (for parallelism) and partitions (for lifecycle management). Each bucket is the unit of replication and assignment to TabletServer nodes.

Data Distribution Model

Key Concepts:

Concept	Description	Configuration
Bucket	Unit of distribution and replication	`bucket.num` (required)
Bucket Key	Field(s) used for bucketing	`bucket.key` (default: primary key)
Partition	Lifecycle management unit	`PARTITIONED BY (field)`
Replication Factor	Copies per bucket	`bucket.replication.factor` (default: 1)
ISR	In-sync replicas eligible for leadership	Managed by `ReplicaManager`

Partition Management:

Manual partitions: ALTER TABLE ADD PARTITION (dt='2024-01-01')
Auto-partitioning: table.auto-partition.enabled = 'true' with table.auto-partition.time-unit
Multi-field partitions: Composite keys like (region, dt)

For detailed bucketing and partitioning strategies, see Data Distribution and Replication.

Sources: website/docs/table-design/data-distribution/partitioning.md1-95 website/docs/engine-flink/options.md66-74 Diagram 6 from high-level diagrams

Storage Architecture

Fluss implements a three-tier storage model to optimize for different data access patterns:

Three-Tier Storage Model

Tier Characteristics:

Tier	Storage	Latency	Use Case	Configuration
Hot	Local disk	Microseconds	Recent data, point queries	`data.dir`
Warm	Object storage	Milliseconds	Older data, recovery	`remote.data.dir`
Cold	Lakehouse	Seconds	Analytics, reporting	`datalake.*` properties

Tiering Configuration Example:

Table-Level Tiering:

Sources: website/docs/maintenance/configuration.md36-39 website/docs/quickstart/lakehouse.md36-114 Diagram 3 from high-level architecture

Integration with Compute Engines

Apache Flink Integration

Fluss provides deep integration with Apache Flink through the Flink Connector (fluss-flink-*.jar), which implements Flink's catalog, source, and sink interfaces.

Flink Connector Architecture

Connector Capabilities:

Operation	Support	Implementation
DDL	CREATE/ALTER/DROP TABLE, partitions	`FlinkCatalog` → `Admin`
Streaming Read	✅ Full/incremental/latest	`FlinkTableSource`
Batch Read	✅ Snapshot, point lookup, LIMIT	`FlinkTableSource`
Write	✅ INSERT/UPDATE/DELETE	`FlinkTableSink`
Lookup Join	✅ Async/sync, prefix lookup	`FlinkLookupFunction`

Catalog Creation:

Supported Flink Versions: 1.18, 1.19, 1.20, 2.1, 2.2

For detailed Flink operations, see Flink Integration.

Sources: website/docs/engine-flink/getting-started.md1-232 website/docs/engine-flink/ddl.md9-33 Diagram 3 from high-level diagrams

Lakehouse Integration

Fluss provides automatic tiering to lakehouse formats through a Flink-based tiering service, unifying real-time and historical data under a single table abstraction.

Lakehouse Tiering Architecture

Tiering Service Features:

Feature	Description
Continuous Tiering	Flink job subscribes to Fluss CDC and writes to lakehouse
Exactly-Once	Offset tracking via CoordinatorServer ensures no data loss
Union Read	Flink queries automatically merge real-time + historical data
Lake-First	Tables created in lake catalog before Fluss registration
System Columns	`__bucket`, `__offset`, `__timestamp` preserve lineage

Supported Lake Formats:

Apache Paimon: Streaming-first, LSM architecture
Apache Iceberg: MOR strategy, ecosystem support
Lance: Arrow-native, ML/AI workloads

Configuration:

For detailed lakehouse setup, see Streaming Lakehouse Integration.

Sources: website/docs/quickstart/lakehouse.md1-290 website/docs/streaming-lakehouse/overview.md1-47 Diagram 4 from high-level diagrams

Configuration and Deployment

Configuration System

Fluss uses a hierarchical configuration system with multiple levels of precedence:

Configuration Hierarchy

Key Configuration Files:

conf/server.yaml - Main configuration (parsed on startup)
ConfigOptions.java - Option definitions with defaults and validation
ZooKeeper paths - Runtime cluster and table configuration

Example Configuration:

Sources: website/docs/maintenance/configuration.md1-130 fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java1-460

Deployment Options

Fluss supports multiple deployment modes for different environments:

Deployment	Use Case	Setup Complexity	HA Support
Local Cluster	Development, testing	Low (single command)	No
Distributed Cluster	Production	Medium (manual setup)	Yes (with ZK)
Docker/Compose	Quick start, CI/CD	Low (docker compose)	No
Kubernetes/Helm	Cloud production	Medium (Helm charts)	Yes

Quick Start:

Docker Compose Example:

Sources: website/docs/install-deploy/overview.md137-145 website/docs/install-deploy/deploying-local-cluster.md1-76 website/docs/install-deploy/deploying-with-docker.md1-367

Observability and Monitoring

Fluss provides comprehensive observability through metrics, logs, and integration with standard monitoring tools.

Observability Stack

Metrics Configuration:

Key Metric Categories:

Coordinator Metrics: Cluster state, metadata operations
TabletServer Metrics: Write throughput, read latency, replication lag
Table Metrics: Per-table/bucket statistics
RocksDB Metrics: KV store performance

Logging Configuration:

Default: Log4j 2 (conf/log4j.properties)
Alternative: Logback (conf/logback.xml)
Log locations: log/ directory in Fluss installation
Log rotation: Automatic daily rotation

Sources: website/docs/maintenance/observability/quickstart.md1-216 website/docs/maintenance/observability/metric-reporters.md1-125 website/docs/maintenance/observability/logging.md1-70

Getting Started

To start using Apache Fluss:

1. Quick Start with Docker:

2. Create Your First Table:

3. Enable Lakehouse Tiering:

Where to Go Next:

This overview provides foundational concepts. For deeper exploration:

Topic	Wiki Page	Description
Architecture	Architecture	CoordinatorServer, TabletServer, ReplicaManager details
Table Design	Table Types and Data Model	Schema design, primary keys, partitioning strategies
Storage	Storage Architecture	LogTablet, KvTablet, Arrow format, RocksDB internals
Replication	Data Distribution and Replication	ISR, leader election, bucket assignment
Configuration	Configuration System	ConfigOptions, server.yaml, dynamic reconfiguration
Flink Integration	Flink Integration	DDL/DML operations, lookup joins, streaming/batch reads
Lakehouse	Streaming Lakehouse	Paimon/Iceberg/Lance integration, union reads
Deployment	Deployment Options	Local cluster, distributed cluster, Kubernetes
Monitoring	Observability	Metrics, logging, Prometheus, Grafana

Quick Links:

Quickstart - Hands-on tutorial
Quickstart - Lakehouse integration guide
Downloads - Latest releases and connectors

Sources: website/docs/quickstart/flink.md1-411 website/docs/intro.md35-42 README.md57-72

Key Features

Apache Fluss provides several unique capabilities that distinguish it as a streaming storage system:

Core Features

Feature Highlights:

Feature	Benefit	Implementation
Sub-Second Freshness	Real-time analytics	Arrow columnar format, tail reads
Column Pruning	10x read speedup	`ColumnarRow` with selective column reads
High-QPS Lookups	Dimension enrichment	`KvTablet` with RocksDB
Union Reads	Real-time + historical	`UnionSplitReader` in Flink connector
Partial Updates	Efficient upserts	Pre-write buffer in `KvTablet`
Exactly-Once	No data loss	ISR replication, checkpoint-based tiering

For architectural details, see Architecture.

Sources: README.md48-56 website/docs/intro.md1-34 website/src/components/HomepageFeatures/index.tsx30-67

Where to Go Next

This overview provides a foundation for understanding Apache Fluss. For deeper dives:

Getting Started: See Flink Quickstart for hands-on tutorial
Architecture Details: See Architecture for component internals
Table Design: See Table Types and Data Distribution for data modeling
Deployment: See Configuration and Deployment for production setup
Flink Integration: See Flink Integration for SQL operations and connector options
Lakehouse: See Streaming Lakehouse Integration for Paimon/Iceberg/Lance integration
Operations: See Operations and Maintenance for monitoring and management

Sources: website/docs/intro.md35-42

Overview

What is Apache Fluss?

High-Level Architecture

Key Concepts

Table Types

Data Distribution

Storage Architecture

Integration with Compute Engines

Apache Flink Integration

Lakehouse Integration

Configuration and Deployment

Configuration System

Deployment Options

Observability and Monitoring

Getting Started

Key Features

Where to Go Next

On this page

Overview

What is Apache Fluss?

High-Level Architecture

Key Concepts

Table Types

Data Distribution

Storage Architecture

Integration with Compute Engines

Apache Flink Integration

Lakehouse Integration

Configuration and Deployment

Configuration System

Deployment Options

Observability and Monitoring

Getting Started

Key Features

Where to Go Next

On this page