-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description
When executing aggregation queries with GROUP BY, the order of columns in the result set does not match the order specified in the SELECT clause. The behavior depends on whether the aggregation has an explicit alias or not, but in both cases the result order is wrong.
Examples
Case 1: Aggregation without alias — buckets appear first
SELECT count(*), category FROM ecommerce GROUP BY categoryExpected: count_* | category
Actual: category | count_*
The GROUP BY bucket column is placed first, ignoring the SELECT order.
Case 2: Aggregation with alias — aggregation appears first
SELECT category, count(*) as cnt FROM ecommerce GROUP BY categoryExpected: category | cnt
Actual: cnt | category
The aliased aggregation is placed first, ignoring the SELECT order.
Case 3: Multiple columns
SELECT country, count(*) as cnt, category FROM ecommerce GROUP BY country, categoryExpected: country | cnt | category
Actual: order is unpredictable
Expected Behaviour
Result columns MUST appear in the exact order specified in the SELECT clause, regardless of whether columns are bucket columns (GROUP BY), aliased aggregations, or unaliased aggregations. This is mandated by the SQL standard.
Impact
- JDBC/Arrow clients: consumers accessing columns by index get unexpected data
- User expectations: the SQL standard guarantees SELECT order is preserved
- Downstream tools: Superset, DuckDB, and other tools assume column order matches the query
Resolution
Fixed as a side-effect of Issue #41 (fieldsWithComputedAliases). The root cause was a mismatch between the names used in ES aggregations and the names returned by extractOutputFieldNames, which prevented normalizeRow from correctly reordering columns.
Root cause
normalizeRow(row, requestedFields) reorders row entries to match requestedFields (SELECT clause order). However, for unnamed expression columns (e.g., COUNT(*) without AS), the ES aggregation was named count_* (via SQLAggregation.fromField) while extractOutputFieldNames returned a different name (e.g., * via sourceField). This mismatch meant normalizeRow couldn't find the field in the row, resulting in wrong column order.
How Issue #41 fixes this
fieldsWithComputedAliases assigns consistent positional aliases (__c1, __c2, ...) to expression columns without explicit aliases. This ensures:
- ES aggregation name =
__c1(fromSQLAggregation.fromFieldusingfieldAlias) extractOutputFieldNamesreturns__c1(fromfieldsWithComputedAliases)normalizeRowmatches__c1in the row → reorders correctly to SELECT clause order
For explicitly aliased columns (Case 2), the naming was already consistent — the alias is used both in the ES aggregation and in extractOutputFieldNames. normalizeRow enforces the correct order in all cases.
Tests
- Integration tests:
ReplGatewayIntegrationSpec— 3 new tests verifying column order:- Aggregation first, bucket second (
SELECT COUNT(*), profile.city) - Bucket first, aliased aggregation second (
SELECT profile.city, COUNT(*) AS cnt) - Mixed: bucket, aliased aggregation, unnamed aggregation (
SELECT profile.city, COUNT(*) AS cnt, AVG(age))
- Aggregation first, bucket second (