-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54854][SQL] Add a UUIDv7 queryId to SQLExecution Events #53625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
Outdated
Show resolved
Hide resolved
| val outputMode: OutputMode, | ||
| val checkpointLocation: String, | ||
| val queryId: UUID, | ||
| override val queryId: UUID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this use UUIDv7?
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a UUIDv7-based queryId to SQL execution events to provide a globally unique, time-ordered identifier for tracking SQL queries across systems. The key changes include:
- Introduction of a UUIDv7 generator for creating time-ordered unique identifiers
- Addition of
queryIdfield toQueryExecutionand propagation through the SQL execution lifecycle - Modification of
SparkListenerSQLExecutionStartto include thequeryId
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sql/core/src/main/scala/org/apache/spark/sql/util/UUIDv7Generator.scala | New UUIDv7 generator implementation following RFC draft specification |
| sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala | Adds queryId field and executionCount tracking to QueryExecution |
| sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala | Implements queryId propagation via SparkContext local properties and generation logic |
| sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala | Adds queryId parameter to SparkListenerSQLExecutionStart event |
| sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala | Updates event handler to extract queryId from events |
| sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/runtime/IncrementalExecution.scala | Declares queryId as override since it's now inherited from QueryExecution |
| sql/core/src/test/scala/org/apache/spark/sql/util/UUIDv7GeneratorSuite.scala | Comprehensive test suite for UUIDv7 generator covering format, uniqueness, monotonicity, and timestamp accuracy |
| sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala | Tests for queryId propagation in concurrent and sequential execution scenarios |
| sql/core/src/test/scala/org/apache/spark/sql/execution/ui/*.scala | Updates test event constructors to include None for queryId parameter |
| sql/core/src/test/scala/org/apache/spark/sql/execution/history/*.scala | Updates test event constructors to include None for queryId parameter |
| sql/connect/server/src/test/scala/org/apache/spark/sql/connect/ui/SparkConnectServerListenerSuite.scala | Updates test event constructors to include None for queryId parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format#section-5.2 | ||
| */ | ||
|
|
||
| private val random = new Random() |
Copilot
AI
Dec 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shared Random instance in the object is not thread-safe. Since generateFrom can be called from multiple threads concurrently (via SQLExecution.withNewExecutionId0), the shared Random instance may produce non-unique UUIDs due to race conditions in Random.nextLong().
Consider using ThreadLocalRandom.current() instead of a shared Random instance to ensure thread-safety. ThreadLocalRandom is the standard approach for concurrent random number generation in Java/Scala and is used elsewhere in the Spark codebase.
sql/core/src/main/scala/org/apache/spark/sql/util/UUIDv7Generator.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
Outdated
Show resolved
Hide resolved
| * Deterministic UUIDv7 generation from epochMilli and nanos. | ||
| * Called by generate() and used for testing. | ||
| */ | ||
| def generateFrom(epochMilli: Long, nano: Int): UUID = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nano is not mentioned in the RFC, does this implementation have a reference version?
BTW, OpenJDK 26 starts to provide a built-in UUIDv7 implementation openjdk/jdk@642ba4c, which could be a good reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this implementation follows the spec here
|
could the
|
|
|
||
| // Tracks how many times this QueryExecution has been executed. | ||
| // Used by SQLExecution to determine whether to use the existing queryId or generate a new one. | ||
| val executionCount = new AtomicInteger(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need to expose executionCount? Should we make it private[sql]? Also, it seems that an AtomicBoolean is sufficient to check whether this is the first execution. Theoretically, executionCount can go back to 0 after many executions.
What changes were proposed in this pull request?
Add a new UUIDv7
queryIdobject to SparkListenerSQLExecutionStart and propagate it through the SQL execution lifecycle via SparkContext local properties.Currently, Spark uses
executionIdto connect jobs, stages, and tasks with SQL executions. However, this field is not globally unique, as multiple Spark applications can include the sameexecutionIds. UUIDv7 allows for a time-ordered, globally unique identifier for improved telemetry across systems.In a separate PR, plan to add
queryIdas a new field to SparkUI.Why are the changes needed?
Add a globally unique, time-ordered identifier for Spark SQL query execution events.
Does this PR introduce any user-facing change?
No, this PR simply adds the internal queryId which is not yet surfaced.
How was this patch tested?
Added tests for UUIDv7 generator and SQLExecution queryId propagation.
Was this patch authored or co-authored using generative AI tooling?
UUIDv7Generator was written with help of
claude-4.5-sonnetaccording to the specification in https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format#section-5.2