-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-35160: [Go] Add arrayApproxEqualString to ignore null chars #35161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
| return err | ||
| data, err = hex.DecodeString(v) | ||
| if err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems odd to add here. Should we also add this to AppendValueFromString and update the docs that you can also use hex to specify binary strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm....Prob it was in another branch and ended up here. will remove
| func stripNulls(s string) string { | ||
| return strings.ReplaceAll(s, "\x00", "") | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Arrow spec a String array should be all valid utf8. So shouldn't this only be for trimming trailing Nulls? (ie: use strings.TrimRight instead of strings.ReplaceAll )?
To this end, can you have a test with a string that contains a null inside it that isn't at the end and ensure that this approxequal only trims nulls, and doesn't strip them from within the string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, wasn't aware of the valid utf8 specification that can't contain NULLs in the middle of the string. In that case there is no check in string.Append method to make sure there are no nulls inside the string. Should we add this check or string nulls inside the string? wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for performance reasons we don't validate the utf8 strings on Append currently and leave it up to the producer to ensure that they are passing valid utf-8 strings when constructing the array (if it's not valid utf-8 they should be using a Binary array).
On my list of things to do eventually is a "Validate" method for each array type like the C++ library has. That "Validate" method would do the UTF-8 validity check on the buffer so that a consumer can choose when they take the performance hit for validating the utf-8.
This should solve bunch of issues that we have with NullValues and StripNulls logic propagated to all our destination plugins. Apparently [Valid](apache/arrow#35161 (comment)) utf8 string can't have null values in the middle of the string so we should just strip those in our type system (it can have at the end or the start but I suggest actually stripping those down as well as it's really unnecessary as far as I know and it exist prob for some archaic reason like sending a string over the wire without any additional encoding/protocol, which we have via arrow). Once we move sources to arrow we will add the validation/strip on the source side.
|
@yevgenypats are you still planning on working on this? It still needs the that hex change removed etc. |
|
Let's close this for now and I might open this once again a bit later, Im not sure if this is needed after your comment regarding nulls. |
|
|
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?