Skip to content

Result column order does not match SELECT clause order in aggregations #47

@fupelaqu

Description

@fupelaqu

Description

When executing aggregation queries with GROUP BY, the order of columns in the result set does not match the order specified in the SELECT clause. The behavior depends on whether the aggregation has an explicit alias or not, but in both cases the result order is wrong.

Examples

Case 1: Aggregation without alias — buckets appear first

SELECT count(*), category FROM ecommerce GROUP BY category

Expected: count_* | category
Actual: category | count_*

The GROUP BY bucket column is placed first, ignoring the SELECT order.

Case 2: Aggregation with alias — aggregation appears first

SELECT category, count(*) as cnt FROM ecommerce GROUP BY category

Expected: category | cnt
Actual: cnt | category

The aliased aggregation is placed first, ignoring the SELECT order.

Case 3: Multiple columns

SELECT country, count(*) as cnt, category FROM ecommerce GROUP BY country, category

Expected: country | cnt | category
Actual: order is unpredictable

Expected Behaviour

Result columns MUST appear in the exact order specified in the SELECT clause, regardless of whether columns are bucket columns (GROUP BY), aliased aggregations, or unaliased aggregations. This is mandated by the SQL standard.

Impact

  • JDBC/Arrow clients: consumers accessing columns by index get unexpected data
  • User expectations: the SQL standard guarantees SELECT order is preserved
  • Downstream tools: Superset, DuckDB, and other tools assume column order matches the query

Resolution

Fixed as a side-effect of Issue #41 (fieldsWithComputedAliases). The root cause was a mismatch between the names used in ES aggregations and the names returned by extractOutputFieldNames, which prevented normalizeRow from correctly reordering columns.

Root cause

normalizeRow(row, requestedFields) reorders row entries to match requestedFields (SELECT clause order). However, for unnamed expression columns (e.g., COUNT(*) without AS), the ES aggregation was named count_* (via SQLAggregation.fromField) while extractOutputFieldNames returned a different name (e.g., * via sourceField). This mismatch meant normalizeRow couldn't find the field in the row, resulting in wrong column order.

How Issue #41 fixes this

fieldsWithComputedAliases assigns consistent positional aliases (__c1, __c2, ...) to expression columns without explicit aliases. This ensures:

  1. ES aggregation name = __c1 (from SQLAggregation.fromField using fieldAlias)
  2. extractOutputFieldNames returns __c1 (from fieldsWithComputedAliases)
  3. normalizeRow matches __c1 in the row → reorders correctly to SELECT clause order

For explicitly aliased columns (Case 2), the naming was already consistent — the alias is used both in the ES aggregation and in extractOutputFieldNames. normalizeRow enforces the correct order in all cases.

Tests

  • Integration tests: ReplGatewayIntegrationSpec — 3 new tests verifying column order:
    • Aggregation first, bucket second (SELECT COUNT(*), profile.city)
    • Bucket first, aliased aggregation second (SELECT profile.city, COUNT(*) AS cnt)
    • Mixed: bucket, aliased aggregation, unnamed aggregation (SELECT profile.city, COUNT(*) AS cnt, AVG(age))

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions