Skip to content

feat: replace grpc with fibp binary protocol#11

Open
vieiralucas wants to merge 9 commits intomainfrom
feat/21.1-binary-protocol
Open

feat: replace grpc with fibp binary protocol#11
vieiralucas wants to merge 9 commits intomainfrom
feat/21.1-binary-protocol

Conversation

@vieiralucas
Copy link
Copy Markdown
Member

@vieiralucas vieiralucas commented Apr 4, 2026

Summary

  • Replace entire gRPC transport layer with Fila's custom binary protocol (FIBP) per docs/protocol.md
  • New fibp/ package: codec for all 42 opcodes, frame reader/writer with continuation support, encoding primitives
  • New connection manager with request/response multiplexing, server-push delivery routing, keepalive
  • Added admin operations (CreateQueue, DeleteQueue, GetStats, ListQueues, SetConfig, GetConfig, ListConfig, Redrive)
  • Added auth management (CreateApiKey, RevokeApiKey, ListApiKeys, SetAcl, GetAcl)
  • Added batch operations (AckMany, NackMany)
  • All 18 FIBP error codes mapped to sentinel errors
  • Removed gRPC/protobuf dependencies — zero external dependencies (stdlib only)
  • Net -3,008 lines, 28 tests passing

Test plan

  • go vet ./... passes
  • go test ./... — 28 tests pass (14 codec, 14 client)
  • Unit tests cover: all encoding primitives, frame round-trips, continuation frames, all opcode encode/decode, connection lifecycle, enqueue (direct + accumulator), consume streaming, ack, nack, error handling, admin ops, auth ops, leader redirect
  • Integration tests against running fila-server (requires binary)
  • CI workflow passes

🤖 Generated with Claude Code


Summary by cubic

Replaced gRPC with the Fila Binary Protocol (FIBP) to cut transport overhead and drop external deps. Adds a TCP connection manager, a full opcode codec, admin/auth APIs, and hardens consume/encoding paths.

  • New Features

    • New fibp/ package: codec for 42 opcodes, frame reader/writer with continuation, and encoding primitives.
    • TCP connection manager: request/response multiplexing, server-push routing for Consume, keepalive, and leader-hint redirect.
    • Hot path over FIBP: Enqueue, Consume, Ack, Nack; plus AckMany and NackMany. Accumulator modes (Auto/Linger/Disabled) preserved.
    • Admin and auth APIs: Create/DeleteQueue, GetStats, ListQueues, Set/Get/ListConfig, Redrive; Create/Revoke/List API keys, Set/Get ACL.
    • Maps all FIBP error codes to sentinel errors for errors.Is checks. Zero external dependencies (stdlib only).
  • Bug Fixes

    • Consume cleanup: cancel no longer closes the delivery channel; read-loop cleanup runs in defer; waiter notifications are non-blocking; leader redirect is bounded to one hop.
    • Accumulator uses context.Background for batch RPCs to avoid one caller’s cancellation canceling the whole batch.
    • Decode functions validate item counts to prevent large allocations; FrameWriter errors when maxFrameSize is too small.
    • Tests strengthened for TLS validation and leader-redirect behavior.

Written for commit 334d5b4. Summary will update on new commits.

feat: transparent leader hint reconnect on consume
Add BatchEnqueue() for explicit batch operations, three batch modes
(Auto/Linger/Disabled) controlling how Enqueue() batches internally,
and transparent unpacking of batched ConsumeResponse.messages field.

Enqueue() now routes through a background batcher by default (Auto mode)
that opportunistically groups concurrent calls. Single-item batches use
the singular Enqueue RPC to preserve specific error types. Close() drains
pending messages before disconnecting.

Proto updated to include BatchEnqueue RPC and repeated messages field on
ConsumeResponse for backward-compatible delivery batching.
- update service.proto to unified api (no more BatchEnqueue rpc)
- enqueue/ack/nack use repeated message fields
- consume uses only repeated messages field (singular removed)
- rename BatchMode to AccumulatorMode, batch.go to accumulator.go
- add EnqueueMany() replacing BatchEnqueue()
- ItemError wraps sentinel errors (ErrQueueNotFound, ErrMessageNotFound)
  so errors.Is works on per-item results
- all 20 tests pass (5 unit + 15 integration)
feat: 30.2 — unified api surface
- client.go: handle both value and pointer AccumulatorModeDisabled in
  type assertion to prevent accumulator from starting unexpectedly
- accumulator.go: cap runAuto drain loop at maxAutoBatchSize (1000) to
  prevent exceeding gRPC 4MB max message size under high throughput
- enqueue.go: validate EnqueueMany response result count matches input
  message count, fill missing results with explicit errors
- proto: add reserved directives for removed field numbers in
  EnqueueRequest, ConsumeResponse, AckRequest, NackRequest to prevent
  accidental field number reuse
Replace the entire gRPC transport layer with Fila's custom binary
protocol (FIBP). The SDK now communicates directly over TCP with
length-prefixed binary frames per docs/protocol.md.

Changes:
- New fibp/ package: codec for all 42 opcodes, frame reader/writer
  with continuation support, encoding primitives
- New conn.go: TCP connection manager with request/response
  multiplexing, server-push delivery routing, keepalive
- Migrated all hot-path ops (Enqueue, Consume, Ack, Nack) to FIBP
- Added batch operations: AckMany, NackMany
- Added admin operations: CreateQueue, DeleteQueue, GetStats,
  ListQueues, SetConfig, GetConfig, ListConfig, Redrive
- Added auth management: CreateApiKey, RevokeApiKey, ListApiKeys,
  SetAcl, GetAcl
- All 18 FIBP error codes mapped to sentinel errors
- TLS and API key auth work over FIBP handshake
- Leader hint redirect preserved for cluster mode
- Accumulator pattern preserved (auto/linger/disabled modes)
- Removed: grpc, protobuf deps, filav1/, proto/ directories
- 28 tests (14 codec + 14 client), all passing
- Zero external dependencies (stdlib only)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 issues found across 30 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="conn.go">

<violation number="1" location="conn.go:146">
P1: When `ctx.Done()` triggers inside delivery forwarding, `readLoop` returns without running waiter/delivery cleanup.</violation>

<violation number="2" location="conn.go:206">
P1: `closeWaiters` can deadlock by blocking on a full waiter channel during shutdown.</violation>

<violation number="3" location="conn.go:210">
P0: `cancelConsume` can close a delivery channel while `readLoop` is sending to it, which can panic with `send on closed channel`.</violation>
</file>

<file name="client_test.go">

<violation number="1" location="client_test.go:385">
P2: `TestNotLeaderRedirect` only asserts non-nil error, so it can pass on unrelated failures instead of verifying `ProtocolError` redirect metadata.</violation>

<violation number="2" location="client_test.go:429">
P3: `TestTLSClientCertWithoutTLS` is a false-positive-prone test because it accepts any dial error from `localhost:0` instead of asserting the specific option-validation failure.</violation>
</file>

<file name="consume.go">

<violation number="1" location="consume.go:79">
P1: Leader redirect handling is unbounded recursive, despite the API promising a single retry.</violation>
</file>

<file name="fibp/frame.go">

<violation number="1" location="fibp/frame.go:34">
P2: Do not force `maxBody` to 1 when `maxFrameSize` is too small; return an error instead. The current logic can write frames larger than `maxFrameSize`.</violation>
</file>

<file name="accumulator.go">

<violation number="1" location="accumulator.go:197">
P1: The batch request uses the first item's context, so one caller's context cancellation (e.g. timeout) will fail the entire batch. Consider deriving a merged context or using `context.Background()` with a deadline from the longest-lived item, so that one caller's cancellation does not cascade to all other callers in the same batch.</violation>
</file>

<file name="fibp/decode.go">

<violation number="1" location="fibp/decode.go:49">
P2: Validate `count` before allocating the results slice. As written, a malformed frame can set `count` to a huge value and force large allocations/DoS before decode errors are raised.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

- P0: cancelConsume no longer closes delivery channel (avoids send-on-
  closed-channel panic from readLoop race)
- P1: readLoop cleanup runs in defer, not on ctx.Done select (ensures
  closeWaiters always runs)
- P1: closeWaiters uses non-blocking sends to avoid deadlock on full
  waiter channels
- P1: leader redirect uses direct subscribe instead of recursive
  Consume call (bounds to single redirect)
- P1: accumulator uses context.Background() for batch RPC so one
  caller's cancellation doesn't cascade to entire batch
- P2: decode functions validate count before allocating slices (DoS
  protection against malformed frames)
- P2: FrameWriter returns error when maxFrameSize is too small instead
  of silently writing oversized frames
- P2: TestNotLeaderRedirect verifies redirect was attempted, not just
  any error
- P3: TestTLSClientCertWithoutTLS asserts specific validation message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant