Restoring an Old Database

Fifteen years ago the original MySQL db was setup with an unknown version. To achieve compatibility for the era I chose v 5.7 with the idea of updating to make it production ready. The old database dump was restored successfully albeit with some encoding issues. Notably, the post_content field was littered with artifacts likely from copy|paste operations when posts were originally created. Example:

"The last of 4 bowls turned from the Cherry Tree from Spalding Dr. Â Just a very nice bud vase."

The content was rife with this pesky Â character and there were too many to one by one start editing posts.

The “Â” character is a common encoding issue that occurs when UTF-8 encoded content is incorrectly interpreted as Latin-1 (ISO-8859-1) encoding. This typically happens when there’s a non-breaking space (  or Unicode U+00A0) in the original content.

I tried running a query to remove it:

UPDATE wp_posts 
SET post_content = REPLACE(post_content, 'Â', ' ')
WHERE post_type = 'post';

The Â should be replaced by a regular space. However, windows terminal would not play nice and stripped out the Â replacing it with %%.

Trying some other methods targeting certain HEX values:

UPDATE wp_posts 
SET post_content = REPLACE(post_content, UNHEX('C2A0'), ' ')
WHERE post_content LIKE CONCAT('%', UNHEX('C2A0'), '%');

or specific byte sequences:

UPDATE wp_posts 
SET post_content = REPLACE(post_content, CHAR(194, 160), ' ')
WHERE post_content REGEXP CHAR(194, 160);

returned an empty result. I was only guessing at the character origin that was manifesting as Â. But this worked:

SELECT post_title, 
       HEX(post_content) as hex_content,
       LENGTH(post_content) as content_length
FROM wp_posts 
WHERE post_title = 'Bowl 712 Cherry';

It returned:

-------------------------------------------------------------------------------------------------------+----------------+
| Bowl 712 Cherry | 3C6120687265663D2268747470733A2F2F6A7477672E64776462636E632E6E65742F77702D636F6E74656E742F75706C6F6164732F323031332F30332F426F776C2D3731322D4368657272792D572E6A7067223E3C696D6720636C6173733D22616C69676E6E6F6E652073697A652D6D656469756D2077702D696D6167652D39303422207469746C653D22426F776C203731312043686572727922207372633D2268747470733A2F2F6A7477672E64776462636E632E6E65742F77702D636F6E74656E742F75706C6F6164732F323031332F30332F426F776C2D3731322D4368657272792D572D313937783330302E6A70672220616C743D22222077696474683D2231393722206865696768743D2233303022202F3E3C2F613E0D0A0D0A546865206C617374206F66203420626F776C73207475726E65642066726F6D207468652043686572727920547265652066726F6D205370616C64696E672044722E20C382204A75737420612076657279206E6963652062756420766173652E |            381 |
| Bowl 712 Cherry |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |              0 |
+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------

In case you’re wondering it was Claude AI stepping me through these operations and of course I couldn’t see the errant HEX code but Claude could…

Around “Spalding Dr.” and “Just”, I can see: ...44722E20C382204A757374...

Let me decode this:

44722E20 = “Dr. “
C382 = This is the problematic character sequence!
204A757374 = ” Just”

So the issue is the C382 bytes between “Dr. ” and ” Just”.

and so, the fix!

UPDATE wp_posts 
SET post_content = REPLACE(post_content, UNHEX('C382'), '')
WHERE post_title = 'Bowl 712 Cherry';

Some forensic work after the exercise verifies:

C382 is considered as two separate bytes in UTF-8 encoding:

C3 is the first byte of a two-byte sequence.
82 is the second byte of that sequence.
Together, C3 82 decodes to the Unicode character U+00C2, which is the Latin Capital Letter A With Circumflex (Â)

Could have just asked Google, What is the hex value of Â ? Goggle Search Labs | AI Overview answer: The hexadecimal value for the character Â (Latin capital letter A with circumflex) is 00C2 or C2. In UTF-8 encoding, it is represented as C3 82. Either would have got the job done. Good exercise though. The database has been cleaned up.

Strombotne Blog (since 2002)

Strombotne is an Historic Norwegian Farm

Leave a Reply