[Variant] Improve getter API for `VariantList` and `VariantObject` #7757

friendlymatthew · 2025-06-23T23:48:33Z

Which issue does this PR close?

Closes [Variant] Improved API for accessing Variant Objects and lists #7756

Rationale for this change

Updates VariantList and VariantObject getter methods to follow similar conventions to std::Vec and std::HashMap:

Existing method	Proposed method
`VariantList::get(index) -> Result<Variant>`	`VariantList::get(index) -> Option<Variant>`
`VariantObject::field_by_name(name) -> Result<Option<Variant>>`	`VariantObject::get(name) -> Option<Variant>`
`VariantObject::field(i) -> Result<Variant>`	`VariantObject::field(i) -> Option<Variant>`
`VariantObject::field_name(i) -> Result<Variant>`	`VariantObject::field_name(i) -> Option<Variant>`

One thing to note, however, the existing methods all returned Result<T> since these getters are not only exposed as public API, but also used for validation inside the constructor try_new.

Since the latter usage requires an error message, I chose to rename the original fallible methods with an _err suffix and use them internally. The new public methods now act as wrappers over these _err variants.

Are there any user-facing changes?

New API (and docs with tests)

friendlymatthew · 2025-06-23T23:50:48Z

parquet-variant/src/variant/list.rs

I chose this path since VariantList::get was used in two separate code paths-- 1) to retrieve the element at that index and 2) to perform validation in VariantList::try_new.

Since the latter path requires an error message, I chose to private and rename the fallible get for internal usage

#7711 is an interesting/related issue to think about..

Agree. And if we do make the constructor less thorough then we have to be more careful in our implementations of get et al, because they could be encountering bad bytes for the first time.

friendlymatthew · 2025-06-24T02:05:31Z

parquet-variant/src/variant/object.rs

A surprising bug arose! This assertion will also fail on main 👀

I recommend we file a ticket

The assertion on L290 fails? Or the field_name(3) call on L289 panics?

Looking at the code, it seems like a missing bounds check in field_name which reads garbage from the wrong part of the slice, which I guess must somehow cause a panic inside the metadata dictionary lookup?

... but this PR adds the missing bounds checks, so I would expect the panic went away? If so, it would be good to triage the actual panic (in metadata dictionary code), since the bounds check here only masks the real problem.

Here is the issue for this bug: #7784

This behavior doesn't occur when using a VariantBuilder (since the underlying metadata is slightly different), but it's still important to figure out why this test case is failing.

alamb

Thank you @friendlymatthew -- this looks great. I have a few suggestions

@scovich do you perhaps have time to review this PR as well?

also FYI @PinkCrow007 @mkarbo @superserious-dev @Weijun-H and @carpecodeum (the list of people who have already contributed code to Variant is getting quite nice)

alamb · 2025-06-24T09:36:36Z

parquet-variant/src/builder.rs

parquet-variant/src/variant/object.rs

alamb · 2025-06-24T09:42:13Z

parquet-variant/src/variant/object.rs

I worry that the ok() discards errors and will make it hard to debug

Also creating an ArrowError requires allocating a String so it is non trivial in terms of cost to create an Err and then discard it via ok()

What would you think about making errors (which can happen with invalid variant values) panics ?

Something like

Suggested change

self.field_err(i).ok()

self.field_err(i).expect("Invalid variant value")

We may want to special case the check for out of bounds access to return None explicitly for performance reasons.

I think the validation strategy @scovich and I have been discussing is that the Variant would be validated during construction and then the get and other APIs would then assume that validation has already been done.

Eventually we envision there may be a way to disable the validation on construction for usecases where we know the variant is valid.

alamb · 2025-06-24T09:43:17Z

parquet-variant/src/variant/object.rs

What would you think about naming this try_field instead of field_err (and similarly for the other error variants?)

I think the try_ prefix is somewhat more Rust "standard" though I do agree it is somewhat of an opinion

parquet-variant/src/variant/object.rs

alamb · 2025-06-24T09:44:33Z

parquet-variant/src/variant/object.rs

I recommend we file a ticket

alamb · 2025-06-24T09:47:06Z

parquet-variant/src/variant/list.rs

Same comment as below about documenting error behavior and not ignoring that error

However, for the case where index is out of bounds specifically, it might make sense to return None (but not ignore other errors)

friendlymatthew · 2025-06-24T12:55:15Z

I'm going to document these changes after work. Will revert to a draft for now!

scovich · 2025-06-24T21:33:31Z

parquet-variant/src/variant/object.rs

I don't love panics... is there a reason out of bounds is None but any other problem is a panic? If anything, this is backward -- the caller could prevent an out of bounds panic by checking length before making this call. But they can't prevent a panic caused by invalid bytes in the buffer.

I guess the constructor currently validates the bytes, so this should actually be unreachable code.
Still, ok() seems cleaner?

Heh. And now I see outdated comments that recommended moving away from ok()...

Given that this panic is ostensibly unreachable, would it be a good place for something like this?

let result = self.try_field_name(i); debug_assert!(matches!(result, Ok(_)), "Unexpected validation failure: {result:?}"); result.ok()

maybe expect would be cleaner?

self.try_field_name(i) .expect("Unexpected validation failure")

The expect will cause a panic but will include the message and the underlying error message

I used expect and I agree with @scovich's comments about the err case being basically unreachable. But I imagine this code will change with #7711

scovich · 2025-06-24T21:36:02Z

parquet-variant/src/variant/list.rs

Agree. And if we do make the constructor less thorough then we have to be more careful in our implementations of get et al, because they could be encountering bad bytes for the first time.

scovich · 2025-06-24T21:41:57Z

parquet-variant/src/utils.rs

This really needs try_trait_v2 feature to stabilize... then the passed in key extractor could decide the return type (Option vs. Result)

scovich · 2025-06-24T21:47:18Z

parquet-variant/src/variant/object.rs

The assertion on L290 fails? Or the field_name(3) call on L289 panics?

Looking at the code, it seems like a missing bounds check in field_name which reads garbage from the wrong part of the slice, which I guess must somehow cause a panic inside the metadata dictionary lookup?

... but this PR adds the missing bounds checks, so I would expect the panic went away? If so, it would be good to triage the actual panic (in metadata dictionary code), since the bounds check here only masks the real problem.

alamb · 2025-06-25T00:50:38Z

parquet-variant/src/variant/object.rs

Suggested change

match self.try_field(i) {

Ok(field) => Some(field),

Err(err) => panic!("validation error: {}", err),

}

self.try_field(i).expect("validation error after construction")

friendlymatthew · 2025-06-26T00:27:46Z

Hi @alamb and @scovich,

This PR should be ready for review.

Some high level details about this PR:

I commented out the failing tests and filed an issue [Variant] Field lookup with out of bounds index causes unwanted behavior #7784
I added some comments to our getter methods about the panics. Since these panics are basically unreachable, I want to rethink validation: [Variant] Revisit validation cost of infallible iterators #7711

alamb · 2025-06-26T21:36:46Z

Agree. And if we do make the constructor less thorough then we have to be more careful in our implementations of get et al, because they could be encountering bad bytes for the first time.

I added some comments to our getter methods about the panics. Since these panics are basically unreachable, I want to rethink validation: #7711

I think there are two distinct usecases and the validation behavior will likely differ between them:

Code that wants to programatically traverse Variants in some arbitrary order (e.g. tests, converting to json, etc) that will likely touch most/all of the Variant values
Code that wants to try to select a field from the Variant (eg. a query like variant_field(variant_column, "my.nested.field") or whatever.

I think our current behavior of Validate on construction works well for the first usecase

I think the second usecase will most likely prefer only validating the part of the variant on the path to retrieving the requested fied

Something like this:

// create a variant, validate on read
let variant = Variant::new_no_validation(&metadata, &value);
let field = variant.try_field("foo")?.try_field("bar")?

It would be on the user to choose the API that was appropriate (e.g. use of field() without first validating the variant can panic).

I think there is precident for this with rust slices:

calling arr[100] when the 100 is out of bounds will panic
Calling arr.get(100) does the bounds check

THoughts?

alamb

I think this looks great to me -- thank you @friendlymatthew

parquet-variant/src/variant/object.rs

scovich · 2025-06-26T22:28:58Z

Something like this:

// create a variant, validate on read
let variant = Variant::new_no_validation(&metadata, &value);
let field = variant.try_field("foo")?.try_field("bar")?

That could work, and there is certainly precedent as you say. We would need to surface the fallible versions of the methods, and the (currently private) fallible iterators as well, tho. And probably a validate method that does deep and/or shallow validation. And potentially we would want the underlying fallible iterator to use the no-validation constructor, and the validating constructor would then invoke that validation method on every returned child.

I'll try to take a stab at that, if I can scrape together some time.

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

scovich

Forgot to stamp it before

alamb · 2025-06-27T10:34:03Z

I'll try to take a stab at that, if I can scrape together some time.

Thanks @scovich

I think improving the validation is tracked by

[Variant] Revisit validation cost of infallible iterators #7711

alamb · 2025-06-27T10:39:39Z

Thanks again @scovich and @friendlymatthew -- Variants are looking pretty good

# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Follow on to #7757 # Rationale for this change Clippy is failing after merging #7757 due to a logical conflict and a new clippy release: https://github.com/apache/arrow-rs/actions/runs/15924256410/job/44917809525 ``` error: variables can be used directly in the `format!` string --> parquet-variant/src/variant/list.rs:130:25 | 130 | Err(err) => panic!("validation error: {}", err), | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#uninlined_format_args = note: `-D clippy::uninlined-format-args` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(clippy::uninlined_format_args)]` help: change this to | 1[30](https://github.com/apache/arrow-rs/actions/runs/15924256410/job/44917809525#step:6:31) - Err(err) => panic!("validation error: {}", err), 130 + Err(err) => panic!("validation error: {err}"), | ``` # What changes are included in this PR? Fix clippy to get CI clean on main # Are these changes tested? By CI # Are there any user-facing changes? No

github-actions bot added the parquet Changes to the parquet crate label Jun 23, 2025

friendlymatthew commented Jun 23, 2025

View reviewed changes

friendlymatthew marked this pull request as draft June 23, 2025 23:56

friendlymatthew force-pushed the friendlymatthew/friendlier-getters-for-list-and-objects branch 2 times, most recently from ca77a33 to f22466a Compare June 24, 2025 02:04

friendlymatthew commented Jun 24, 2025

View reviewed changes

friendlymatthew marked this pull request as ready for review June 24, 2025 02:05

alamb reviewed Jun 24, 2025

View reviewed changes

friendlymatthew force-pushed the friendlymatthew/friendlier-getters-for-list-and-objects branch from f22466a to 17a240a Compare June 24, 2025 12:16

friendlymatthew marked this pull request as draft June 24, 2025 12:55

friendlymatthew force-pushed the friendlymatthew/friendlier-getters-for-list-and-objects branch from d04117f to 270b5de Compare June 24, 2025 18:20

scovich reviewed Jun 24, 2025

View reviewed changes

alamb reviewed Jun 25, 2025

View reviewed changes

friendlymatthew force-pushed the friendlymatthew/friendlier-getters-for-list-and-objects branch from 2e9e06f to d0cc4f1 Compare June 25, 2025 17:28

friendlymatthew mentioned this pull request Jun 25, 2025

[Variant] Field lookup with out of bounds index causes unwanted behavior #7784

Closed

friendlymatthew added 5 commits June 25, 2025 20:12

Friendlier getters for List and Objects

973cb8f

Address comments - pass 1

e1a7880

Use expect

020a8c9

Comment known panic test case

2a619c3

Tests

c98d636

friendlymatthew force-pushed the friendlymatthew/friendlier-getters-for-list-and-objects branch from b74fc78 to c98d636 Compare June 26, 2025 00:12

friendlymatthew added 2 commits June 25, 2025 20:17

Add comment about panics

a26df80

Reword test

cd8492c

friendlymatthew marked this pull request as ready for review June 26, 2025 00:23

alamb approved these changes Jun 26, 2025

View reviewed changes

parquet-variant/src/variant/object.rs Show resolved Hide resolved

Update parquet-variant/src/variant/object.rs

1e55623

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

scovich approved these changes Jun 27, 2025

View reviewed changes

alamb mentioned this pull request Jun 27, 2025

[Variant] Revisit validation cost of infallible iterators #7711

Open

alamb merged commit 1fdb318 into apache:main Jun 27, 2025
11 of 12 checks passed

alamb mentioned this pull request Jun 27, 2025

Minor: fix clippy in parquet-variant after logical conflict #7803

Merged

alamb mentioned this pull request Jul 28, 2025

[Variant] Improved API for accessing Variant Objects and lists #7756

Closed

	self.field_err(i).ok()
	self.field_err(i).expect("Invalid variant value")

-        match self.try_field(i) {
-            Ok(field) => Some(field),
-            Err(err) => panic!("validation error: {}", err),
-        }
+       self.try_field(i).expect("validation error after construction")

[Variant] Improve getter API for VariantList and VariantObject #7757

[Variant] Improve getter API for VariantList and VariantObject #7757

Uh oh!

Conversation

friendlymatthew commented Jun 23, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friendlymatthew commented Jun 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friendlymatthew Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friendlymatthew commented Jun 26, 2025

Uh oh!

alamb commented Jun 26, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scovich commented Jun 26, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 27, 2025

Uh oh!

Uh oh!

alamb commented Jun 27, 2025

[Variant] Improve getter API for `VariantList` and `VariantObject` #7757

[Variant] Improve getter API for `VariantList` and `VariantObject` #7757

friendlymatthew commented Jun 23, 2025 •

edited by alamb

Loading

friendlymatthew Jun 25, 2025 •

edited

Loading