v0.14.1 #157

yimsk · 2026-01-26T00:05:17Z

Summary

fix(ecs): batch DescribeClusters to handle >100 clusters (ECS service failure in some case #156)

Changes

DescribeClusters API has a 100-element limit. Users with >100 clusters got InvalidParameterException. Added batching (chunks of 100) matching existing pattern in tasks/dao.go and services/dao.go.

Merge

⚠️ MERGE COMMIT only (no squash — develop↔main)

claude · 2026-01-26T00:10:35Z

Pull Request Review: v0.14.1

Summary

This PR fixes a critical bug where users with >100 ECS clusters would encounter InvalidParameterException due to AWS API limits. The fix implements batching for the DescribeClusters API call, matching patterns already established in the codebase.

Code Quality ✅

Strengths:

Consistent Pattern: The batching implementation in custom/ecs/clusters/dao.go:55-75 correctly follows the existing pattern used in tasks/dao.go:95-117 and services/dao.go:90-114
Clean Implementation: The batch size of 100 is appropriate and matches the AWS API limit for DescribeClusters
Proper Slicing: Correctly handles the last batch with end = len(clusterArns) to avoid slice bounds issues

Code Structure:

for i := 0; i < len(clusterArns); i += 100 {
    end := i + 100
    if end > len(clusterArns) {
        end = len(clusterArns)
    }
    // Process batch...
}

This is the standard pattern used throughout the ECS module.

Potential Issues ⚠️

1. Error Handling Inconsistency (Medium Priority)

Current implementation (custom/ecs/clusters/dao.go:67-70):

descOutput, err := d.client.DescribeClusters(ctx, descInput)
if err != nil {
    log.Warn("describe clusters", "error", err)
    continue
}

Issue: The error handling silently continues on failure, which differs from similar code:

tasks/dao.go:110 logs: "failed to describe tasks" with cluster context
services/dao.go:105 logs: "failed to describe services" with cluster context

Recommendation:

if err != nil {
    log.Warn("failed to describe clusters batch", "batch", i/100+1, "error", err)
    continue
}

This provides:

More descriptive message matching codebase conventions
Batch number for debugging (helps identify which batch failed)
Better alignment with existing error handling patterns

2. Partial Failure Behavior (Low Priority - Informational)

When a batch fails, the function continues and returns partial results without indicating some clusters were skipped. This matches the pattern in tasks/dao.go and services/dao.go, so it's consistent with the codebase design.

However, consider whether users should be notified when cluster information is incomplete. The current silent failure could lead to confusion if users expect to see all clusters but only receive some.

Options:

Keep current behavior (consistent with existing code) ✅
Return a wrapped error that includes partial results
Add a summary log after the loop indicating total clusters vs. successfully described

For this PR, option 1 is appropriate since you're matching established patterns.

Performance Considerations ✅

Efficient Pre-allocation: resources := make([]dao.Resource, 0, len(clusterArns)) correctly pre-allocates capacity
Optimal Batch Size: Using 100 (the API limit) minimizes API calls
No Unnecessary Copies: Direct append to resources slice is efficient

The implementation should handle large cluster counts efficiently. For 1000 clusters, this reduces API calls from potentially timing out on a single 1000-element request to 10 manageable batches.

Security Concerns ✅

No security issues identified:

No user input is directly used in AWS API calls
Context is properly threaded through
No credential handling changes
Follows AWS SDK best practices

Test Coverage ⚠️

Missing: No test file exists for custom/ecs/clusters/dao.go

Recommendation: While this PR fixes a critical bug and should be merged, consider adding tests in a follow-up PR:

// custom/ecs/clusters/dao_test.go
func TestClusterDAO_List_BatchesLargeCounts(t *testing.T) {
    // Test with >100 clusters to verify batching
    // Mock ECS client to return 150 cluster ARNs
    // Verify DescribeClusters called twice (100 + 50)
}

func TestClusterDAO_List_HandlesPartialFailure(t *testing.T) {
    // Test that partial batch failures don't fail entire operation
    // Mock second batch to fail
    // Verify first batch results are still returned
}

This would match the test coverage pattern in custom/ecs/clusters/resource_test.go.

Additional Notes

Import Addition: The log import at dao.go:13 is necessary and correct
API Limit Documentation: The comment // Describe clusters in batches of 100 (API limit) is helpful and accurate
Backward Compatibility: This change is fully backward compatible - users with <100 clusters see no behavior change

Verdict: ✅ APPROVE with Minor Suggestions

This is a well-implemented bug fix that:

Solves a real user problem (ECS service failure in some case #156)
Follows established codebase patterns
Has minimal risk
Is production-ready

Suggested improvements (non-blocking):

Enhance error logging message for consistency
Add test coverage in a follow-up PR

Great work matching the existing code patterns! The fix is clean, simple, and effective.

Review generated with attention to code quality, consistency, and production readiness

fix(ecs): batch DescribeClusters to handle >100 clusters (#156)

b4589a2

This comment was marked as outdated.

Sign in to view

Merge remote-tracking branch 'origin/main' into develop

8596a7b

yimsk merged commit 1fce913 into main Jan 26, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.14.1 #157

v0.14.1 #157

Uh oh!

yimsk commented Jan 26, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

claude bot commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

v0.14.1 #157

v0.14.1 #157

Uh oh!

Conversation

yimsk commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Merge

Uh oh!

This comment was marked as outdated.

claude bot commented Jan 26, 2026

Pull Request Review: v0.14.1

Summary

Code Quality ✅

Potential Issues ⚠️

1. Error Handling Inconsistency (Medium Priority)

2. Partial Failure Behavior (Low Priority - Informational)

Performance Considerations ✅

Security Concerns ✅

Test Coverage ⚠️

Additional Notes

Verdict: ✅ APPROVE with Minor Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yimsk commented Jan 26, 2026 •

edited

Loading