Fifteen years ago the original MySQL db was setup with an unknown version. To achieve compatibility for the era I chose v 5.7 with the idea of updating to make it production ready. The old database dump was restored successfully albeit with some encoding issues. Notably, the post_content field was littered with artifacts likely from copy|paste operations when posts were originally created. Example:
"The last of 4 bowls turned from the Cherry Tree from Spalding Dr. Â Just a very nice bud vase."
The content was rife with this pesky  character and there were too many to one by one start editing posts.
The “” character is a common encoding issue that occurs when UTF-8 encoded content is incorrectly interpreted as Latin-1 (ISO-8859-1) encoding. This typically happens when there’s a non-breaking space (
or Unicode U+00A0) in the original content.
I tried running a query to remove it:
UPDATE wp_posts
SET post_content = REPLACE(post_content, 'Â', ' ')
WHERE post_type = 'post';
The  should be replaced by a regular space. However, windows terminal would not play nice and stripped out the  replacing it with %%.
Trying some other methods targeting certain HEX values:
UPDATE wp_posts
SET post_content = REPLACE(post_content, UNHEX('C2A0'), ' ')
WHERE post_content LIKE CONCAT('%', UNHEX('C2A0'), '%');
or specific byte sequences:
UPDATE wp_posts
SET post_content = REPLACE(post_content, CHAR(194, 160), ' ')
WHERE post_content REGEXP CHAR(194, 160);
returned an empty result. I was only guessing at the character origin that was manifesting as Â. But this worked:
SELECT post_title,
HEX(post_content) as hex_content,
LENGTH(post_content) as content_length
FROM wp_posts
WHERE post_title = 'Bowl 712 Cherry';
It returned:
-------------------------------------------------------------------------------------------------------+----------------+
| Bowl 712 Cherry | 3C6120687265663D2268747470733A2F2F6A7477672E64776462636E632E6E65742F77702D636F6E74656E742F75706C6F6164732F323031332F30332F426F776C2D3731322D4368657272792D572E6A7067223E3C696D6720636C6173733D22616C69676E6E6F6E652073697A652D6D656469756D2077702D696D6167652D39303422207469746C653D22426F776C203731312043686572727922207372633D2268747470733A2F2F6A7477672E64776462636E632E6E65742F77702D636F6E74656E742F75706C6F6164732F323031332F30332F426F776C2D3731322D4368657272792D572D313937783330302E6A70672220616C743D22222077696474683D2231393722206865696768743D2233303022202F3E3C2F613E0D0A0D0A546865206C617374206F66203420626F776C73207475726E65642066726F6D207468652043686572727920547265652066726F6D205370616C64696E672044722E20C382204A75737420612076657279206E6963652062756420766173652E | 381 |
| Bowl 712 Cherry | | 0 |
+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------
In case you’re wondering it was Claude AI stepping me through these operations and of course I couldn’t see the errant HEX code but Claude could…
Around “Spalding Dr.” and “Just”, I can see: ...44722E20C382204A757374...
Let me decode this:
44722E20
= “Dr. “C382
= This is the problematic character sequence!204A757374
= ” Just”
So the issue is the C382
bytes between “Dr. ” and ” Just”.
and so, the fix!
UPDATE wp_posts
SET post_content = REPLACE(post_content, UNHEX('C382'), '')
WHERE post_title = 'Bowl 712 Cherry';
Some forensic work after the exercise verifies:
C382
is considered as two separate bytes in UTF-8 encoding:
C3
is the first byte of a two-byte sequence.82
is the second byte of that sequence.- Together,
C3 82
decodes to the Unicode characterU+00C2
, which is the Latin Capital Letter A With Circumflex (Â
)
Could have just asked Google, What is the hex value of  ? Goggle Search Labs | AI Overview answer: The hexadecimal value for the character  (Latin capital letter A with circumflex) is 00C2 or C2. In UTF-8 encoding, it is represented as C3 82. Either would have got the job done. Good exercise though. The database has been cleaned up.