feat: add js_trim() and mb_trim() compat by USERSATOSHI · Pull Request #9519 · WordPress/wordpress-develop

USERSATOSHI · 2025-08-19T07:26:49Z

PHP’s trim() function, by default, only strips a limited set of ASCII whitespace characters, and mb_trim(), introduced in PHP 8.4, does not behave identically to JavaScript’s String.prototype.trim().

This PR implements js_trim(), a PHP function that replicates JavaScript’s String.prototype.trim() behavior.

It works by defining a set of $js_trimmables characters, which are passed to mb_trim() with UTF-8 encoding.

In addition, this PR adds a polyfill for mb_trim() in compat.php to support PHP versions below 8.4 with unit tests for both js_trim() and mb_trim()

Trac ticket: https://core.trac.wordpress.org/ticket/63804

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

github-actions · 2025-08-19T07:26:58Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props tusharbharti, dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

github-actions · 2025-08-19T07:42:37Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

dmsnell · 2026-04-02T14:07:41Z

src/wp-includes/compat.php

+		}
+
+		if ( 'UTF-8' !== $encoding ) {
+			$characters = mb_convert_encoding( $characters, 'UTF-8', $encoding );


I believe this will intentionally corrupt the list of characters in every case that the code runs. is the $characters string not already UTF-8 by construction in the PH source code?

so if we convert it from anything else we’ll be telling PHP to misunderstand the string and double-convert it?

I would imagine that if the $encoding is ISO-8859-1, for instance, that we would get something like â�¯ instead of NARROW NO-BREAK SPACE U+202F.

dmsnell

@USERSATOSHI although this looks sound from the function-call arguments, I would like to hear your thoughts on some of the ways it could interact with actual site data and the encodings of strings coming into it.

there could be an argument for requiring that all incoming strings be converted into UTF-8 before reaching this function.

dmsnell · 2026-04-02T14:12:05Z

src/wp-includes/compat.php

+
+		if ( 'UTF-8' !== $encoding ) {
+			$characters = mb_convert_encoding( $characters, 'UTF-8', $encoding );
+			$str        = mb_convert_encoding( $str, 'UTF-8', $encoding );


this line is a heavy lifter, and I generally encourage folks to disregard content if it’s not UTF-8 because the conversion here is more than likely to introduce corruption.

it may be less risky to check if the string is valid in its own encoding first…

if ( ! is_utf8_charset( $encoding ) && mb_check_encoding( $str, $encoding ) ) { $str = mb_convert_encoding( $str, 'UTF-8', $encoding ); } else { // REJECT! }

but even in this case we run a large risk because most strings will validate as any of the single-byte encodings likely to be set on a real site, if not UTF-8.

the primary source of non-UTF-8 is from legacy database tables, and it’s best to convert encodings at the point of demarcation when reading from the database. any other string sent here is almost certainly going to be in a different encoding than what is set for $encoding

also, I would guess that there is an extremely low likelihood that mb_internal_encoding() matches a site’s blog_charset or the encoding of the incoming text unless they are all UTF-8.

I see. That does make sense.

If I am not wrong, this should fix that?
I will also add a note on top of this on why we did this.

$str_utf8 = $str; if ( ! is_null( $encoding ) && ! is_utf8_charset( $encoding ) ) { if ( ! mb_check_encoding( '', $encoding ) ) { // Unrecognised encoding — return unchanged. return $str; } if ( ! mb_check_encoding( $str, $encoding ) ) { // String does not validate in the given encoding — return unchanged. return $str; } $str_utf8 = mb_convert_encoding( $str, 'UTF-8', $encoding ); } // Use preg_replace to trim the characters from both ends of the string. // Both $characters and $str_utf8 are UTF-8 at this point. $pattern = '/^[' . preg_quote( $characters, '/' ) . ']+|[' . preg_quote( $characters, '/' ) . ']+$/uD'; $trimmed_string = preg_replace( $pattern, '', $str_utf8 ); if ( false === $trimmed_string || null === $trimmed_string ) { return $str; // If preg_replace fails, return the original string. } // Convert back to the original encoding if an explicit non-UTF-8 encoding was given. if ( ! is_null( $encoding ) && ! is_utf8_charset( $encoding ) ) { $trimmed_string = mb_convert_encoding( $trimmed_string, $encoding, 'UTF-8' ); } return $trimmed_string;

to be clearer, I was suggesting aborting if $encoding is not UTF-8 and avoiding the façade of re-encoding. PHP’s mb_trim() does support some re-encoding, but it’s also a minefield.

the $encoding parameter is only being used to segment code points in both strings, not to perform re-encoding. for example…

var_dump( mb_trim( "\xA92025\xA9", "\xEF\xBF\xC0\xA9", 'UTF-8' ) ); string(4) "2025" var_dump( mb_trim( "\xC2\xA92025\xA9", "\xEF\xBF\xC0\xA9", 'UTF-8' ) ); string(6) "©2025"

here, the $characters array is split into three separate “maximal subpart” spans of invalid UTF-8, and the input string is iterated on maximal sub-part bounds, which is why a raw \xA9 is removed but the valid-UTF-8 copyright sign \xC2\xA9 isn’t.

JavaScript doesn’t have these same issues because every string in JavaScript is valid UTF-16. it’s not possible to represent the invalid bytes.

so I think if we attempt to “fix” or re-encode the strings we’re opening a door that only leads to even-more corruption than had we left it to be.

if we really want to polyfill mb_trim() then we need to remove mb_convert_encoding() and replace it with code that iterates code points. note: if mb_trim() is unavailable then it’s almost certain that all mb_ functions are missing.

WordPress can iterate over the UTF-8 code points, and over bytes in the ISO-8859 family, but I doubt it’s worth reproducing. we’d need a large set of functions to iterate over encodings we don’t want to support going forward — best to simply bail and inform developers with something like wp_trigger_error() than to invite more problems.

I see then, I will just inform the user about the non UTF-8 encoding.

…ning error to inform the dev

USERSATOSHI added 5 commits August 18, 2025 21:25

feat: add js_trim() and mb_trim() compat

17cd617

docs: change version to 6.9.0

7a54f09

refactor: fix phpcs errors

b33b6ce

Merge branch 'WordPress:trunk' into try/add-js-trim

b178472

refactor: fix phpcs errors in formatting

a58c949

USERSATOSHI added 2 commits August 22, 2025 16:35

tests: update tests comments

63a5ddc

Merge branch 'trunk' into try/add-js-trim

33d9a9c

dmsnell requested review from dmsnell and removed request for dmsnell August 31, 2025 05:29

dmsnell self-assigned this Aug 31, 2025

Merge branch 'WordPress:trunk' into try/add-js-trim

98dfbf3

dmsnell reviewed Apr 2, 2026

View reviewed changes

USERSATOSHI added 2 commits April 7, 2026 12:59

Fix mb_trim() polyfill to only support UTF-8 encoding and trigger war…

1db297d

…ning error to inform the dev

tests: skip tests if mbstring extension is present

b062abc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add js_trim() and mb_trim() compat#9519

feat: add js_trim() and mb_trim() compat#9519
USERSATOSHI wants to merge 10 commits intoWordPress:trunkfrom
USERSATOSHI:try/add-js-trim

USERSATOSHI commented Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

dmsnell Apr 2, 2026

Uh oh!

dmsnell left a comment

Uh oh!

dmsnell Apr 2, 2026

Uh oh!

USERSATOSHI Apr 6, 2026

Uh oh!

dmsnell Apr 6, 2026

Uh oh!

USERSATOSHI Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

USERSATOSHI commented Aug 19, 2025

Uh oh!

github-actions bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 19, 2025

Test using WordPress Playground

Some things to be aware of

Uh oh!

dmsnell Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

dmsnell left a comment

Choose a reason for hiding this comment

Uh oh!

dmsnell Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

USERSATOSHI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

dmsnell Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

USERSATOSHI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Aug 19, 2025 •

edited

Loading