Skip to content

Conversation

@dust1
Copy link
Contributor

@dust1 dust1 commented Feb 26, 2024

Rationale

Close #1300

Detailed Changes

Check whether it is a utf8 string when inserting data

Test Plan

pass

@dust1 dust1 changed the title Datum struct string type added utf8 check edit: ddtum struct string type added utf8 check Feb 26, 2024
@dust1 dust1 changed the title edit: ddtum struct string type added utf8 check edit: datum struct string type added utf8 check Feb 26, 2024
@dust1
Copy link
Contributor Author

dust1 commented Feb 27, 2024

I forgot, I'll try adding a few more unit tests later

}
}

fn valid_is_utf8(s: &str) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check may be expensive for long string, better to add an option to decide whether to do this check.

}

fn valid_is_utf8(s: &str) -> Result<()> {
from_utf8(s.as_bytes()).context(InvalidStringEncoding { msg: s })?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this function should return bool, not a result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@dust1
Copy link
Contributor Author

dust1 commented Feb 27, 2024

I checked the rust official documentation, and for the way to build Datum objects from String in datum.rs, rust guarantees that String is a utf8 string, which I might need to modify elsewhere. 😢

@jiacai2050
Copy link
Contributor

rust guarantees that String is a utf8 string, which I might need to modify elsewhere. 😢

Yes, I grep the code and find several place contains from_bytes_unchecked(bytes: Bytes).

As for debugging this issue, you can construct a GBK string using SDK, and trace why there is no error for it.

@dust1
Copy link
Contributor Author

dust1 commented Mar 4, 2024

rust guarantees that String is a utf8 string, which I might need to modify elsewhere. 😢

Yes, I grep the code and find several place contains from_bytes_unchecked(bytes: Bytes).

As for debugging this issue, you can construct a GBK string using SDK, and trace why there is no error for it.

Ok, I'll try

@dust1
Copy link
Contributor Author

dust1 commented Mar 14, 2024

The from_bytes_unchecked function will only be called when decoding. I think what I should be looking for is why non-UTF8 characters are saved when encoding. I'll find out later😵

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet argument error: Parquet error: encountered non UTF-8 data

2 participants