feat: add js_trim() and mb_trim() compat#9519
feat: add js_trim() and mb_trim() compat#9519USERSATOSHI wants to merge 10 commits intoWordPress:trunkfrom
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
src/wp-includes/compat.php
Outdated
| } | ||
|
|
||
| if ( 'UTF-8' !== $encoding ) { | ||
| $characters = mb_convert_encoding( $characters, 'UTF-8', $encoding ); |
There was a problem hiding this comment.
I believe this will intentionally corrupt the list of characters in every case that the code runs. is the $characters string not already UTF-8 by construction in the PH source code?
so if we convert it from anything else we’ll be telling PHP to misunderstand the string and double-convert it?
I would imagine that if the $encoding is ISO-8859-1, for instance, that we would get something like � instead of NARROW NO-BREAK SPACE U+202F.
dmsnell
left a comment
There was a problem hiding this comment.
@USERSATOSHI although this looks sound from the function-call arguments, I would like to hear your thoughts on some of the ways it could interact with actual site data and the encodings of strings coming into it.
there could be an argument for requiring that all incoming strings be converted into UTF-8 before reaching this function.
src/wp-includes/compat.php
Outdated
|
|
||
| if ( 'UTF-8' !== $encoding ) { | ||
| $characters = mb_convert_encoding( $characters, 'UTF-8', $encoding ); | ||
| $str = mb_convert_encoding( $str, 'UTF-8', $encoding ); |
There was a problem hiding this comment.
this line is a heavy lifter, and I generally encourage folks to disregard content if it’s not UTF-8 because the conversion here is more than likely to introduce corruption.
it may be less risky to check if the string is valid in its own encoding first…
if (
! is_utf8_charset( $encoding ) &&
mb_check_encoding( $str, $encoding )
) {
$str = mb_convert_encoding( $str, 'UTF-8', $encoding );
} else {
// REJECT!
}but even in this case we run a large risk because most strings will validate as any of the single-byte encodings likely to be set on a real site, if not UTF-8.
the primary source of non-UTF-8 is from legacy database tables, and it’s best to convert encodings at the point of demarcation when reading from the database. any other string sent here is almost certainly going to be in a different encoding than what is set for $encoding
also, I would guess that there is an extremely low likelihood that mb_internal_encoding() matches a site’s blog_charset or the encoding of the incoming text unless they are all UTF-8.
There was a problem hiding this comment.
I see. That does make sense.
If I am not wrong, this should fix that?
I will also add a note on top of this on why we did this.
$str_utf8 = $str;
if ( ! is_null( $encoding ) && ! is_utf8_charset( $encoding ) ) {
if ( ! mb_check_encoding( '', $encoding ) ) {
// Unrecognised encoding — return unchanged.
return $str;
}
if ( ! mb_check_encoding( $str, $encoding ) ) {
// String does not validate in the given encoding — return unchanged.
return $str;
}
$str_utf8 = mb_convert_encoding( $str, 'UTF-8', $encoding );
}
// Use preg_replace to trim the characters from both ends of the string.
// Both $characters and $str_utf8 are UTF-8 at this point.
$pattern = '/^[' . preg_quote( $characters, '/' ) . ']+|[' . preg_quote( $characters, '/' ) . ']+$/uD';
$trimmed_string = preg_replace( $pattern, '', $str_utf8 );
if ( false === $trimmed_string || null === $trimmed_string ) {
return $str; // If preg_replace fails, return the original string.
}
// Convert back to the original encoding if an explicit non-UTF-8 encoding was given.
if ( ! is_null( $encoding ) && ! is_utf8_charset( $encoding ) ) {
$trimmed_string = mb_convert_encoding( $trimmed_string, $encoding, 'UTF-8' );
}
return $trimmed_string;There was a problem hiding this comment.
to be clearer, I was suggesting aborting if $encoding is not UTF-8 and avoiding the façade of re-encoding. PHP’s mb_trim() does support some re-encoding, but it’s also a minefield.
the $encoding parameter is only being used to segment code points in both strings, not to perform re-encoding. for example…
var_dump( mb_trim( "\xA92025\xA9", "\xEF\xBF\xC0\xA9", 'UTF-8' ) );
string(4) "2025"
var_dump( mb_trim( "\xC2\xA92025\xA9", "\xEF\xBF\xC0\xA9", 'UTF-8' ) );
string(6) "©2025"here, the $characters array is split into three separate “maximal subpart” spans of invalid UTF-8, and the input string is iterated on maximal sub-part bounds, which is why a raw \xA9 is removed but the valid-UTF-8 copyright sign \xC2\xA9 isn’t.
JavaScript doesn’t have these same issues because every string in JavaScript is valid UTF-16. it’s not possible to represent the invalid bytes.
so I think if we attempt to “fix” or re-encode the strings we’re opening a door that only leads to even-more corruption than had we left it to be.
if we really want to polyfill mb_trim() then we need to remove mb_convert_encoding() and replace it with code that iterates code points. note: if mb_trim() is unavailable then it’s almost certain that all mb_ functions are missing.
WordPress can iterate over the UTF-8 code points, and over bytes in the ISO-8859 family, but I doubt it’s worth reproducing. we’d need a large set of functions to iterate over encodings we don’t want to support going forward — best to simply bail and inform developers with something like wp_trigger_error() than to invite more problems.
There was a problem hiding this comment.
I see then, I will just inform the user about the non UTF-8 encoding.
PHP’s trim() function, by default, only strips a limited set of ASCII whitespace characters, and mb_trim(), introduced in PHP 8.4, does not behave identically to JavaScript’s String.prototype.trim().
This PR implements
js_trim(), a PHP function that replicates JavaScript’sString.prototype.trim()behavior.It works by defining a set of
$js_trimmablescharacters, which are passed tomb_trim()withUTF-8encoding.In addition, this PR adds a polyfill for
mb_trim()incompat.phpto support PHP versions below 8.4 with unit tests for bothjs_trim()andmb_trim()Trac ticket: https://core.trac.wordpress.org/ticket/63804
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.