Skip to content

[data] Sample schemas in unification#55845

Closed
iamjustinhsu wants to merge 4 commits intoray-project:masterfrom
iamjustinhsu:jhsu/unify-schemas-samples
Closed

[data] Sample schemas in unification#55845
iamjustinhsu wants to merge 4 commits intoray-project:masterfrom
iamjustinhsu:jhsu/unify-schemas-samples

Conversation

@iamjustinhsu
Copy link
Copy Markdown
Contributor

Why are these changes needed?

unification of schemas is slow. Provide a way to limit the number of schemas being sampled. When sample_size=1, unification is skipped and we just return the first non empty schema.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner August 22, 2025 18:46
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sample_size parameter to schema unification functions to improve performance by limiting the number of schemas processed. The changes are applied across various parts of the codebase, using sample_size=1 for performance-sensitive intermediate steps and sample_size=None (for full unification) where a complete schema is necessary.

I've found a critical bug in the implementation that could lead to a TypeError when sample_size is None, which is a common case in the new code. I've also identified an opportunity to refactor duplicated code to improve maintainability. Please see my detailed comments.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Aug 22, 2025
@iamjustinhsu
Copy link
Copy Markdown
Contributor Author

not needed in favor of #55926 and #55854

@iamjustinhsu iamjustinhsu deleted the jhsu/unify-schemas-samples branch August 28, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant