Apache Fluss (Incubating) is a streaming storage system designed for real-time analytics and AI workloads. It serves as the real-time data layer in Lakehouse architectures, bridging the gap between high-throughput streaming ingestion and analytical query engines.
Traditional data lakes excel at batch analytics but struggle with real-time requirements. Streaming systems like Apache Kafka provide low-latency data ingestion but lack efficient point queries and table semantics. Fluss fills this gap by providing:
Streaming Storage with Table Semantics
Key Use Cases:
Fluss (German: "river", pronounced /flus/) enables streaming data to flow continuously into data lakes, like a river converging into larger bodies of water.
Sources: README.md39-56 website/docs/intro.md1-34
Apache Fluss uses a three-tier architecture that separates concerns across client, cluster, and storage layers. The system provides both real-time streaming and lakehouse integration through a unified table abstraction.
Overall System Architecture
Architecture Highlights:
| Layer | Components | Key Responsibilities |
|---|---|---|
| Client | Connection, Admin, Flink connectors | DDL/DML operations, catalog integration |
| Control Plane | CoordinatorServer, ZooKeeper | Metadata, cluster coordination, security |
| Data Plane | TabletServer instances | Data storage, replication, query serving |
| Storage | Local disk, object storage, lakehouse | Hot/warm/cold data tiers |
Client Discovery:
bootstrap.servers (any CoordinatorServer or TabletServer address)Sources: website/docs/install-deploy/overview.md12-135 website/docs/engine-flink/getting-started.md86-107 README.md39-56 Diagram 1 from high-level diagrams
Fluss provides two table types optimized for different workloads, each backed by different storage implementations:
Table Type Architecture
| Feature | Log Table | Primary Key Table |
|---|---|---|
| Storage | LogTablet only | LogTablet + KvTablet (dual) |
| Write Operations | INSERT (append) | INSERT/UPDATE/DELETE (upsert) |
| Primary Use Case | Event streams, time-series | Dimension tables, aggregations |
| Point Queries | ❌ No | ✅ Yes (via KvTablet) |
| Streaming Reads | ✅ Yes | ✅ Yes (via changelog) |
| Storage Format | Arrow columnar | Arrow (log) + RocksDB (KV) |
Example DDL:
For detailed design patterns and use cases, see Table Types and Data Model.
Sources: website/docs/engine-flink/ddl.md61-91 website/docs/quickstart/flink.md237-268 Diagram 2 from high-level diagrams
Fluss distributes data using buckets (for parallelism) and partitions (for lifecycle management). Each bucket is the unit of replication and assignment to TabletServer nodes.
Data Distribution Model
Key Concepts:
| Concept | Description | Configuration |
|---|---|---|
| Bucket | Unit of distribution and replication | bucket.num (required) |
| Bucket Key | Field(s) used for bucketing | bucket.key (default: primary key) |
| Partition | Lifecycle management unit | PARTITIONED BY (field) |
| Replication Factor | Copies per bucket | bucket.replication.factor (default: 1) |
| ISR | In-sync replicas eligible for leadership | Managed by ReplicaManager |
Partition Management:
ALTER TABLE ADD PARTITION (dt='2024-01-01')table.auto-partition.enabled = 'true' with table.auto-partition.time-unit(region, dt)For detailed bucketing and partitioning strategies, see Data Distribution and Replication.
Sources: website/docs/table-design/data-distribution/partitioning.md1-95 website/docs/engine-flink/options.md66-74 Diagram 6 from high-level diagrams
Fluss implements a three-tier storage model to optimize for different data access patterns:
Three-Tier Storage Model
Tier Characteristics:
| Tier | Storage | Latency | Use Case | Configuration |
|---|---|---|---|---|
| Hot | Local disk | Microseconds | Recent data, point queries | data.dir |
| Warm | Object storage | Milliseconds | Older data, recovery | remote.data.dir |
| Cold | Lakehouse | Seconds | Analytics, reporting | datalake.* properties |
Tiering Configuration Example:
Table-Level Tiering:
Sources: website/docs/maintenance/configuration.md36-39 website/docs/quickstart/lakehouse.md36-114 Diagram 3 from high-level architecture
Fluss provides deep integration with Apache Flink through the Flink Connector (fluss-flink-*.jar), which implements Flink's catalog, source, and sink interfaces.
Flink Connector Architecture
Connector Capabilities:
| Operation | Support | Implementation |
|---|---|---|
| DDL | CREATE/ALTER/DROP TABLE, partitions | FlinkCatalog → Admin |
| Streaming Read | ✅ Full/incremental/latest | FlinkTableSource |
| Batch Read | ✅ Snapshot, point lookup, LIMIT | FlinkTableSource |
| Write | ✅ INSERT/UPDATE/DELETE | FlinkTableSink |
| Lookup Join | ✅ Async/sync, prefix lookup | FlinkLookupFunction |
Catalog Creation:
Supported Flink Versions: 1.18, 1.19, 1.20, 2.1, 2.2
For detailed Flink operations, see Flink Integration.
Sources: website/docs/engine-flink/getting-started.md1-232 website/docs/engine-flink/ddl.md9-33 Diagram 3 from high-level diagrams
Fluss provides automatic tiering to lakehouse formats through a Flink-based tiering service, unifying real-time and historical data under a single table abstraction.
Lakehouse Tiering Architecture
Tiering Service Features:
| Feature | Description |
|---|---|
| Continuous Tiering | Flink job subscribes to Fluss CDC and writes to lakehouse |
| Exactly-Once | Offset tracking via CoordinatorServer ensures no data loss |
| Union Read | Flink queries automatically merge real-time + historical data |
| Lake-First | Tables created in lake catalog before Fluss registration |
| System Columns | __bucket, __offset, __timestamp preserve lineage |
Supported Lake Formats:
Configuration:
For detailed lakehouse setup, see Streaming Lakehouse Integration.
Sources: website/docs/quickstart/lakehouse.md1-290 website/docs/streaming-lakehouse/overview.md1-47 Diagram 4 from high-level diagrams
Fluss uses a hierarchical configuration system with multiple levels of precedence:
Configuration Hierarchy
Key Configuration Files:
conf/server.yaml - Main configuration (parsed on startup)ConfigOptions.java - Option definitions with defaults and validationExample Configuration:
Sources: website/docs/maintenance/configuration.md1-130 fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java1-460
Fluss supports multiple deployment modes for different environments:
| Deployment | Use Case | Setup Complexity | HA Support |
|---|---|---|---|
| Local Cluster | Development, testing | Low (single command) | No |
| Distributed Cluster | Production | Medium (manual setup) | Yes (with ZK) |
| Docker/Compose | Quick start, CI/CD | Low (docker compose) | No |
| Kubernetes/Helm | Cloud production | Medium (Helm charts) | Yes |
Quick Start:
Docker Compose Example:
Sources: website/docs/install-deploy/overview.md137-145 website/docs/install-deploy/deploying-local-cluster.md1-76 website/docs/install-deploy/deploying-with-docker.md1-367
Fluss provides comprehensive observability through metrics, logs, and integration with standard monitoring tools.
Observability Stack
Metrics Configuration:
Key Metric Categories:
Logging Configuration:
conf/log4j.properties)conf/logback.xml)log/ directory in Fluss installationSources: website/docs/maintenance/observability/quickstart.md1-216 website/docs/maintenance/observability/metric-reporters.md1-125 website/docs/maintenance/observability/logging.md1-70
To start using Apache Fluss:
1. Quick Start with Docker:
2. Create Your First Table:
3. Enable Lakehouse Tiering:
Where to Go Next:
This overview provides foundational concepts. For deeper exploration:
| Topic | Wiki Page | Description |
|---|---|---|
| Architecture | Architecture | CoordinatorServer, TabletServer, ReplicaManager details |
| Table Design | Table Types and Data Model | Schema design, primary keys, partitioning strategies |
| Storage | Storage Architecture | LogTablet, KvTablet, Arrow format, RocksDB internals |
| Replication | Data Distribution and Replication | ISR, leader election, bucket assignment |
| Configuration | Configuration System | ConfigOptions, server.yaml, dynamic reconfiguration |
| Flink Integration | Flink Integration | DDL/DML operations, lookup joins, streaming/batch reads |
| Lakehouse | Streaming Lakehouse | Paimon/Iceberg/Lance integration, union reads |
| Deployment | Deployment Options | Local cluster, distributed cluster, Kubernetes |
| Monitoring | Observability | Metrics, logging, Prometheus, Grafana |
Quick Links:
Sources: website/docs/quickstart/flink.md1-411 website/docs/intro.md35-42 README.md57-72
Apache Fluss provides several unique capabilities that distinguish it as a streaming storage system:
Core Features
Feature Highlights:
| Feature | Benefit | Implementation |
|---|---|---|
| Sub-Second Freshness | Real-time analytics | Arrow columnar format, tail reads |
| Column Pruning | 10x read speedup | ColumnarRow with selective column reads |
| High-QPS Lookups | Dimension enrichment | KvTablet with RocksDB |
| Union Reads | Real-time + historical | UnionSplitReader in Flink connector |
| Partial Updates | Efficient upserts | Pre-write buffer in KvTablet |
| Exactly-Once | No data loss | ISR replication, checkpoint-based tiering |
For architectural details, see Architecture.
Sources: README.md48-56 website/docs/intro.md1-34 website/src/components/HomepageFeatures/index.tsx30-67
This overview provides a foundation for understanding Apache Fluss. For deeper dives:
Sources: website/docs/intro.md35-42
Refresh this wiki
This wiki was recently refreshed. Please wait 1 day to refresh again.