Skip to content

Conversation

@s3ich4n
Copy link

@s3ich4n s3ich4n commented Jan 21, 2026

When charset detection samples only the first 4096 bytes and detects ascii, but the file contains UTF-8 characters beyond that point, decoding fails with UnicodeDecodeError.

Added fallback to charset_normalizer when UnicodeDecodeError occurs, allowing proper handling of files with non-ASCII characters Spanish, Korean, Japanese, Chinese, etc.)
that appear after the 4096-byte sample.

fixes #1505

When charset detection samples only the first 4096 bytes and detects 'ascii',
but the file contains UTF-8 characters beyond that point,
decoding fails with UnicodeDecodeError.

Added fallback to charset_normalizer when UnicodeDecodeError occurs,
allowing proper handling of files with non-ASCII characters
Spanish, Korean, Japanese, Chinese, etc.)
that appear after the 4096-byte sample.
@s3ich4n
Copy link
Author

s3ich4n commented Jan 21, 2026

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug with Spanish symbols: PlainTextConverter threw UnicodeDecodeError with message: 'ascii' codec can't decode byte

1 participant