Skip to content

Conversation

@yevgenypats
Copy link
Contributor

@yevgenypats yevgenypats commented Apr 16, 2023

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@yevgenypats yevgenypats requested a review from zeroshade as a code owner April 16, 2023 14:04
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue apache/arrow-go#56 has been automatically assigned in GitHub to PR creator.

Comment on lines -322 to +326
return err
data, err = hex.DecodeString(v)
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems odd to add here. Should we also add this to AppendValueFromString and update the docs that you can also use hex to specify binary strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm....Prob it was in another branch and ended up here. will remove

Comment on lines +634 to +637
func stripNulls(s string) string {
return strings.ReplaceAll(s, "\x00", "")
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Arrow spec a String array should be all valid utf8. So shouldn't this only be for trimming trailing Nulls? (ie: use strings.TrimRight instead of strings.ReplaceAll )?

To this end, can you have a test with a string that contains a null inside it that isn't at the end and ensure that this approxequal only trims nulls, and doesn't strip them from within the string?

Copy link
Contributor Author

@yevgenypats yevgenypats Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, wasn't aware of the valid utf8 specification that can't contain NULLs in the middle of the string. In that case there is no check in string.Append method to make sure there are no nulls inside the string. Should we add this check or string nulls inside the string? wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for performance reasons we don't validate the utf8 strings on Append currently and leave it up to the producer to ensure that they are passing valid utf-8 strings when constructing the array (if it's not valid utf-8 they should be using a Binary array).

On my list of things to do eventually is a "Validate" method for each array type like the C++ library has. That "Validate" method would do the UTF-8 validity check on the buffer so that a consumer can choose when they take the performance hit for validating the utf-8.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Apr 17, 2023
kodiakhq bot pushed a commit to cloudquery/plugin-sdk that referenced this pull request Apr 19, 2023
This should solve bunch of issues that we have with NullValues and StripNulls logic propagated to all our destination plugins. Apparently [Valid](apache/arrow#35161 (comment)) utf8 string can't have null values in the middle of the string so we should just strip those in our type system (it can have at the end or the start but I suggest actually stripping those down as well as it's really unnecessary as far as I know and it exist prob for some archaic reason like sending a string over the wire without any additional encoding/protocol, which we have via arrow).

Once we move sources to arrow we will add the validation/strip on the source side.
@zeroshade
Copy link
Member

@yevgenypats are you still planning on working on this? It still needs the that hex change removed etc.

@yevgenypats
Copy link
Contributor Author

Let's close this for now and I might open this once again a bit later, Im not sure if this is needed after your comment regarding nulls.

@yevgenypats yevgenypats closed this May 4, 2023
@candiduslynx candiduslynx deleted the feat/add_approx_equal_string branch June 2, 2023 11:57
@github-actions
Copy link

⚠️ GitHub issue #35160 has no components, please add labels for components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ApproxEqual for strings

2 participants