Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Jun 23, 2025

Which issue does this PR close?

Rationale for this change

Updates VariantList and VariantObject getter methods to follow similar conventions to std::Vec and std::HashMap:

Existing method Proposed method
VariantList::get(index) -> Result<Variant> VariantList::get(index) -> Option<Variant>
VariantObject::field_by_name(name) -> Result<Option<Variant>> VariantObject::get(name) -> Option<Variant>
VariantObject::field(i) -> Result<Variant> VariantObject::field(i) -> Option<Variant>
VariantObject::field_name(i) -> Result<Variant> VariantObject::field_name(i) -> Option<Variant>

One thing to note, however, the existing methods all returned Result<T> since these getters are not only exposed as public API, but also used for validation inside the constructor try_new.

Since the latter usage requires an error message, I chose to rename the original fallible methods with an _err suffix and use them internally. The new public methods now act as wrappers over these _err variants.

Are there any user-facing changes?

New API (and docs with tests)

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 23, 2025
Comment on lines 126 to 132
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose this path since VariantList::get was used in two separate code paths-- 1) to retrieve the element at that index and 2) to perform validation in VariantList::try_new.

Since the latter path requires an error message, I chose to private and rename the fallible get for internal usage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7711 is an interesting/related issue to think about..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. And if we do make the constructor less thorough then we have to be more careful in our implementations of get et al, because they could be encountering bad bytes for the first time.

@friendlymatthew friendlymatthew marked this pull request as draft June 23, 2025 23:56
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/friendlier-getters-for-list-and-objects branch 2 times, most recently from ca77a33 to f22466a Compare June 24, 2025 02:04
Comment on lines 274 to 301
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A surprising bug arose! This assertion will also fail on main 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we file a ticket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion on L290 fails? Or the field_name(3) call on L289 panics?

Looking at the code, it seems like a missing bounds check in field_name which reads garbage from the wrong part of the slice, which I guess must somehow cause a panic inside the metadata dictionary lookup?

... but this PR adds the missing bounds checks, so I would expect the panic went away? If so, it would be good to triage the actual panic (in metadata dictionary code), since the bounds check here only masks the real problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the issue for this bug: #7784

This behavior doesn't occur when using a VariantBuilder (since the underlying metadata is slightly different), but it's still important to figure out why this test case is failing.

@friendlymatthew friendlymatthew marked this pull request as ready for review June 24, 2025 02:05
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew -- this looks great. I have a few suggestions

@scovich do you perhaps have time to review this PR as well?

also FYI @PinkCrow007 @mkarbo @superserious-dev @Weijun-H and @carpecodeum (the list of people who have already contributed code to Variant is getting quite nice)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that the ok() discards errors and will make it hard to debug

Also creating an ArrowError requires allocating a String so it is non trivial in terms of cost to create an Err and then discard it via ok()

What would you think about making errors (which can happen with invalid variant values) panics ?

Something like

Suggested change
self.field_err(i).ok()
self.field_err(i).expect("Invalid variant value")

We may want to special case the check for out of bounds access to return None explicitly for performance reasons.

I think the validation strategy @scovich and I have been discussing is that the Variant would be validated during construction and then the get and other APIs would then assume that validation has already been done.

Eventually we envision there may be a way to disable the validation on construction for usecases where we know the variant is valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about naming this try_field instead of field_err (and similarly for the other error variants?)

I think the try_ prefix is somewhat more Rust "standard" though I do agree it is somewhat of an opinion

Comment on lines 274 to 301
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we file a ticket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as below about documenting error behavior and not ignoring that error

However, for the case where index is out of bounds specifically, it might make sense to return None (but not ignore other errors)

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/friendlier-getters-for-list-and-objects branch from f22466a to 17a240a Compare June 24, 2025 12:16
@friendlymatthew
Copy link
Contributor Author

I'm going to document these changes after work. Will revert to a draft for now!

@friendlymatthew friendlymatthew marked this pull request as draft June 24, 2025 12:55
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/friendlier-getters-for-list-and-objects branch from d04117f to 270b5de Compare June 24, 2025 18:20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love panics... is there a reason out of bounds is None but any other problem is a panic? If anything, this is backward -- the caller could prevent an out of bounds panic by checking length before making this call. But they can't prevent a panic caused by invalid bytes in the buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the constructor currently validates the bytes, so this should actually be unreachable code.
Still, ok() seems cleaner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh. And now I see outdated comments that recommended moving away from ok()...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this panic is ostensibly unreachable, would it be a good place for something like this?

let result = self.try_field_name(i);
debug_assert!(matches!(result, Ok(_)), "Unexpected validation failure: {result:?}");
result.ok()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe expect would be cleaner?

self.try_field_name(i)
  .expect("Unexpected validation failure")

The expect will cause a panic but will include the message and the underlying error message

Copy link
Contributor Author

@friendlymatthew friendlymatthew Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used expect and I agree with @scovich's comments about the err case being basically unreachable. But I imagine this code will change with #7711

Comment on lines 126 to 132
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. And if we do make the constructor less thorough then we have to be more careful in our implementations of get et al, because they could be encountering bad bytes for the first time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really needs try_trait_v2 feature to stabilize... then the passed in key extractor could decide the return type (Option vs. Result)

Comment on lines 274 to 301
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion on L290 fails? Or the field_name(3) call on L289 panics?

Looking at the code, it seems like a missing bounds check in field_name which reads garbage from the wrong part of the slice, which I guess must somehow cause a panic inside the metadata dictionary lookup?

... but this PR adds the missing bounds checks, so I would expect the panic went away? If so, it would be good to triage the actual panic (in metadata dictionary code), since the bounds check here only masks the real problem.

Comment on lines 142 to 145
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match self.try_field(i) {
Ok(field) => Some(field),
Err(err) => panic!("validation error: {}", err),
}
self.try_field(i).expect("validation error after construction")

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/friendlier-getters-for-list-and-objects branch from b74fc78 to c98d636 Compare June 26, 2025 00:12
@friendlymatthew friendlymatthew marked this pull request as ready for review June 26, 2025 00:23
@friendlymatthew
Copy link
Contributor Author

Hi @alamb and @scovich,

This PR should be ready for review.

Some high level details about this PR:

@alamb
Copy link
Contributor

alamb commented Jun 26, 2025

Agree. And if we do make the constructor less thorough then we have to be more careful in our implementations of get et al, because they could be encountering bad bytes for the first time.

I added some comments to our getter methods about the panics. Since these panics are basically unreachable, I want to rethink validation: #7711

I think there are two distinct usecases and the validation behavior will likely differ between them:

  1. Code that wants to programatically traverse Variants in some arbitrary order (e.g. tests, converting to json, etc) that will likely touch most/all of the Variant values
  2. Code that wants to try to select a field from the Variant (eg. a query like variant_field(variant_column, "my.nested.field") or whatever.

I think our current behavior of Validate on construction works well for the first usecase

I think the second usecase will most likely prefer only validating the part of the variant on the path to retrieving the requested fied

Something like this:

// create a variant, validate on read
let variant = Variant::new_no_validation(&metadata, &value);
let field = variant.try_field("foo")?.try_field("bar")?

It would be on the user to choose the API that was appropriate (e.g. use of field() without first validating the variant can panic).

I think there is precident for this with rust slices:

  • calling arr[100] when the 100 is out of bounds will panic
  • Calling arr.get(100) does the bounds check

THoughts?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great to me -- thank you @friendlymatthew

@scovich
Copy link
Contributor

scovich commented Jun 26, 2025

Something like this:

// create a variant, validate on read
let variant = Variant::new_no_validation(&metadata, &value);
let field = variant.try_field("foo")?.try_field("bar")?

That could work, and there is certainly precedent as you say. We would need to surface the fallible versions of the methods, and the (currently private) fallible iterators as well, tho. And probably a validate method that does deep and/or shallow validation. And potentially we would want the underlying fallible iterator to use the no-validation constructor, and the validating constructor would then invoke that validation method on every returned child.

I'll try to take a stab at that, if I can scrape together some time.

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to stamp it before

@alamb
Copy link
Contributor

alamb commented Jun 27, 2025

I'll try to take a stab at that, if I can scrape together some time.

Thanks @scovich

I think improving the validation is tracked by

@alamb alamb merged commit 1fdb318 into apache:main Jun 27, 2025
11 of 12 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 27, 2025

Thanks again @scovich and @friendlymatthew -- Variants are looking pretty good

alamb added a commit that referenced this pull request Jun 27, 2025
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Follow on to #7757

# Rationale for this change

Clippy is failing after merging
#7757 due to a logical conflict
and a new clippy release:


https://github.com/apache/arrow-rs/actions/runs/15924256410/job/44917809525

```
error: variables can be used directly in the `format!` string
   --> parquet-variant/src/variant/list.rs:130:25
    |
130 |             Err(err) => panic!("validation error: {}", err),
    |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#uninlined_format_args
    = note: `-D clippy::uninlined-format-args` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(clippy::uninlined_format_args)]`
help: change this to
    |
1[30](https://github.com/apache/arrow-rs/actions/runs/15924256410/job/44917809525#step:6:31) -             Err(err) => panic!("validation error: {}", err),
130 +             Err(err) => panic!("validation error: {err}"),
    |
```

# What changes are included in this PR?

Fix clippy to get CI clean on main

# Are these changes tested?
By CI

# Are there any user-facing changes?

No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Improved API for accessing Variant Objects and lists

3 participants