-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Improve getter API for VariantList and VariantObject
#7757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Improve getter API for VariantList and VariantObject
#7757
Conversation
parquet-variant/src/variant/list.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose this path since VariantList::get was used in two separate code paths-- 1) to retrieve the element at that index and 2) to perform validation in VariantList::try_new.
Since the latter path requires an error message, I chose to private and rename the fallible get for internal usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#7711 is an interesting/related issue to think about..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. And if we do make the constructor less thorough then we have to be more careful in our implementations of get et al, because they could be encountering bad bytes for the first time.
ca77a33 to
f22466a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A surprising bug arose! This assertion will also fail on main 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we file a ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion on L290 fails? Or the field_name(3) call on L289 panics?
Looking at the code, it seems like a missing bounds check in field_name which reads garbage from the wrong part of the slice, which I guess must somehow cause a panic inside the metadata dictionary lookup?
... but this PR adds the missing bounds checks, so I would expect the panic went away? If so, it would be good to triage the actual panic (in metadata dictionary code), since the bounds check here only masks the real problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the issue for this bug: #7784
This behavior doesn't occur when using a VariantBuilder (since the underlying metadata is slightly different), but it's still important to figure out why this test case is failing.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- this looks great. I have a few suggestions
@scovich do you perhaps have time to review this PR as well?
also FYI @PinkCrow007 @mkarbo @superserious-dev @Weijun-H and @carpecodeum (the list of people who have already contributed code to Variant is getting quite nice)
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that the ok() discards errors and will make it hard to debug
Also creating an ArrowError requires allocating a String so it is non trivial in terms of cost to create an Err and then discard it via ok()
What would you think about making errors (which can happen with invalid variant values) panics ?
Something like
| self.field_err(i).ok() | |
| self.field_err(i).expect("Invalid variant value") |
We may want to special case the check for out of bounds access to return None explicitly for performance reasons.
I think the validation strategy @scovich and I have been discussing is that the Variant would be validated during construction and then the get and other APIs would then assume that validation has already been done.
Eventually we envision there may be a way to disable the validation on construction for usecases where we know the variant is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about naming this try_field instead of field_err (and similarly for the other error variants?)
I think the try_ prefix is somewhat more Rust "standard" though I do agree it is somewhat of an opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we file a ticket
parquet-variant/src/variant/list.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as below about documenting error behavior and not ignoring that error
However, for the case where index is out of bounds specifically, it might make sense to return None (but not ignore other errors)
f22466a to
17a240a
Compare
|
I'm going to document these changes after work. Will revert to a draft for now! |
d04117f to
270b5de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love panics... is there a reason out of bounds is None but any other problem is a panic? If anything, this is backward -- the caller could prevent an out of bounds panic by checking length before making this call. But they can't prevent a panic caused by invalid bytes in the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the constructor currently validates the bytes, so this should actually be unreachable code.
Still, ok() seems cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh. And now I see outdated comments that recommended moving away from ok()...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this panic is ostensibly unreachable, would it be a good place for something like this?
let result = self.try_field_name(i);
debug_assert!(matches!(result, Ok(_)), "Unexpected validation failure: {result:?}");
result.ok()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe expect would be cleaner?
self.try_field_name(i)
.expect("Unexpected validation failure")The expect will cause a panic but will include the message and the underlying error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/src/variant/list.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. And if we do make the constructor less thorough then we have to be more careful in our implementations of get et al, because they could be encountering bad bytes for the first time.
parquet-variant/src/utils.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really needs try_trait_v2 feature to stabilize... then the passed in key extractor could decide the return type (Option vs. Result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion on L290 fails? Or the field_name(3) call on L289 panics?
Looking at the code, it seems like a missing bounds check in field_name which reads garbage from the wrong part of the slice, which I guess must somehow cause a panic inside the metadata dictionary lookup?
... but this PR adds the missing bounds checks, so I would expect the panic went away? If so, it would be good to triage the actual panic (in metadata dictionary code), since the bounds check here only masks the real problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| match self.try_field(i) { | |
| Ok(field) => Some(field), | |
| Err(err) => panic!("validation error: {}", err), | |
| } | |
| self.try_field(i).expect("validation error after construction") |
2e9e06f to
d0cc4f1
Compare
b74fc78 to
c98d636
Compare
|
This PR should be ready for review. Some high level details about this PR:
|
I think there are two distinct usecases and the validation behavior will likely differ between them:
I think our current behavior of Validate on construction works well for the first usecase I think the second usecase will most likely prefer only validating the part of the variant on the path to retrieving the requested fied Something like this: // create a variant, validate on read
let variant = Variant::new_no_validation(&metadata, &value);
let field = variant.try_field("foo")?.try_field("bar")?It would be on the user to choose the API that was appropriate (e.g. use of I think there is precident for this with rust slices:
THoughts? |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great to me -- thank you @friendlymatthew
That could work, and there is certainly precedent as you say. We would need to surface the fallible versions of the methods, and the (currently private) fallible iterators as well, tho. And probably a I'll try to take a stab at that, if I can scrape together some time. |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to stamp it before
Thanks @scovich I think improving the validation is tracked by |
|
Thanks again @scovich and @friendlymatthew -- Variants are looking pretty good |
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Follow on to #7757 # Rationale for this change Clippy is failing after merging #7757 due to a logical conflict and a new clippy release: https://github.com/apache/arrow-rs/actions/runs/15924256410/job/44917809525 ``` error: variables can be used directly in the `format!` string --> parquet-variant/src/variant/list.rs:130:25 | 130 | Err(err) => panic!("validation error: {}", err), | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#uninlined_format_args = note: `-D clippy::uninlined-format-args` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(clippy::uninlined_format_args)]` help: change this to | 1[30](https://github.com/apache/arrow-rs/actions/runs/15924256410/job/44917809525#step:6:31) - Err(err) => panic!("validation error: {}", err), 130 + Err(err) => panic!("validation error: {err}"), | ``` # What changes are included in this PR? Fix clippy to get CI clean on main # Are these changes tested? By CI # Are there any user-facing changes? No
Which issue does this PR close?
Rationale for this change
Updates
VariantListandVariantObjectgetter methods to follow similar conventions tostd::Vecandstd::HashMap:VariantList::get(index) -> Result<Variant>VariantList::get(index) -> Option<Variant>VariantObject::field_by_name(name) -> Result<Option<Variant>>VariantObject::get(name) -> Option<Variant>VariantObject::field(i) -> Result<Variant>VariantObject::field(i) -> Option<Variant>VariantObject::field_name(i) -> Result<Variant>VariantObject::field_name(i) -> Option<Variant>One thing to note, however, the existing methods all returned
Result<T>since these getters are not only exposed as public API, but also used for validation inside the constructortry_new.Since the latter usage requires an error message, I chose to rename the original fallible methods with an
_errsuffix and use them internally. The new public methods now act as wrappers over these_errvariants.Are there any user-facing changes?
New API (and docs with tests)