Skip to content

Conversation

@yimsk
Copy link
Contributor

@yimsk yimsk commented Jan 26, 2026

Summary

Changes

DescribeClusters API has a 100-element limit. Users with >100 clusters got InvalidParameterException. Added batching (chunks of 100) matching existing pattern in tasks/dao.go and services/dao.go.

Merge

⚠️ MERGE COMMIT only (no squash — develop↔main)

@claude

This comment was marked as outdated.

@claude
Copy link

claude bot commented Jan 26, 2026

Pull Request Review: v0.14.1

Summary

This PR fixes a critical bug where users with >100 ECS clusters would encounter InvalidParameterException due to AWS API limits. The fix implements batching for the DescribeClusters API call, matching patterns already established in the codebase.

Code Quality ✅

Strengths:

  • Consistent Pattern: The batching implementation in custom/ecs/clusters/dao.go:55-75 correctly follows the existing pattern used in tasks/dao.go:95-117 and services/dao.go:90-114
  • Clean Implementation: The batch size of 100 is appropriate and matches the AWS API limit for DescribeClusters
  • Proper Slicing: Correctly handles the last batch with end = len(clusterArns) to avoid slice bounds issues

Code Structure:

for i := 0; i < len(clusterArns); i += 100 {
    end := i + 100
    if end > len(clusterArns) {
        end = len(clusterArns)
    }
    // Process batch...
}

This is the standard pattern used throughout the ECS module.

Potential Issues ⚠️

1. Error Handling Inconsistency (Medium Priority)

Current implementation (custom/ecs/clusters/dao.go:67-70):

descOutput, err := d.client.DescribeClusters(ctx, descInput)
if err != nil {
    log.Warn("describe clusters", "error", err)
    continue
}

Issue: The error handling silently continues on failure, which differs from similar code:

  • tasks/dao.go:110 logs: "failed to describe tasks" with cluster context
  • services/dao.go:105 logs: "failed to describe services" with cluster context

Recommendation:

if err != nil {
    log.Warn("failed to describe clusters batch", "batch", i/100+1, "error", err)
    continue
}

This provides:

  • More descriptive message matching codebase conventions
  • Batch number for debugging (helps identify which batch failed)
  • Better alignment with existing error handling patterns

2. Partial Failure Behavior (Low Priority - Informational)

When a batch fails, the function continues and returns partial results without indicating some clusters were skipped. This matches the pattern in tasks/dao.go and services/dao.go, so it's consistent with the codebase design.

However, consider whether users should be notified when cluster information is incomplete. The current silent failure could lead to confusion if users expect to see all clusters but only receive some.

Options:

  1. Keep current behavior (consistent with existing code) ✅
  2. Return a wrapped error that includes partial results
  3. Add a summary log after the loop indicating total clusters vs. successfully described

For this PR, option 1 is appropriate since you're matching established patterns.

Performance Considerations ✅

  • Efficient Pre-allocation: resources := make([]dao.Resource, 0, len(clusterArns)) correctly pre-allocates capacity
  • Optimal Batch Size: Using 100 (the API limit) minimizes API calls
  • No Unnecessary Copies: Direct append to resources slice is efficient

The implementation should handle large cluster counts efficiently. For 1000 clusters, this reduces API calls from potentially timing out on a single 1000-element request to 10 manageable batches.

Security Concerns ✅

No security issues identified:

  • No user input is directly used in AWS API calls
  • Context is properly threaded through
  • No credential handling changes
  • Follows AWS SDK best practices

Test Coverage ⚠️

Missing: No test file exists for custom/ecs/clusters/dao.go

Recommendation: While this PR fixes a critical bug and should be merged, consider adding tests in a follow-up PR:

// custom/ecs/clusters/dao_test.go
func TestClusterDAO_List_BatchesLargeCounts(t *testing.T) {
    // Test with >100 clusters to verify batching
    // Mock ECS client to return 150 cluster ARNs
    // Verify DescribeClusters called twice (100 + 50)
}

func TestClusterDAO_List_HandlesPartialFailure(t *testing.T) {
    // Test that partial batch failures don't fail entire operation
    // Mock second batch to fail
    // Verify first batch results are still returned
}

This would match the test coverage pattern in custom/ecs/clusters/resource_test.go.

Additional Notes

  1. Import Addition: The log import at dao.go:13 is necessary and correct
  2. API Limit Documentation: The comment // Describe clusters in batches of 100 (API limit) is helpful and accurate
  3. Backward Compatibility: This change is fully backward compatible - users with <100 clusters see no behavior change

Verdict: ✅ APPROVE with Minor Suggestions

This is a well-implemented bug fix that:

Suggested improvements (non-blocking):

  1. Enhance error logging message for consistency
  2. Add test coverage in a follow-up PR

Great work matching the existing code patterns! The fix is clean, simple, and effective.


Review generated with attention to code quality, consistency, and production readiness

@yimsk yimsk merged commit 1fce913 into main Jan 26, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants