Restoring an Old Database

Fifteen years ago the original MySQL db was setup with an unknown version. To achieve compatibility for the era I chose v 5.7 with the idea of updating to make it production ready. The old database dump was restored successfully albeit with some encoding issues. Notably, the post_content field was littered with artifacts likely from copy|paste operations when posts were originally created. Example:

The content was rife with this pesky  character and there were too many to one by one start editing posts.

The “” character is a common encoding issue that occurs when UTF-8 encoded content is incorrectly interpreted as Latin-1 (ISO-8859-1) encoding. This typically happens when there’s a non-breaking space (  or Unicode U+00A0) in the original content.

I tried running a query to remove it:

UPDATE wp_posts 
SET post_content = REPLACE(post_content, 'Â', ' ')
WHERE post_type = 'post';

The  should be replaced by a regular space. However, windows terminal would not play nice and stripped out the  replacing it with %%.

Trying some other methods targeting certain HEX values:

UPDATE wp_posts 
SET post_content = REPLACE(post_content, UNHEX('C2A0'), ' ')
WHERE post_content LIKE CONCAT('%', UNHEX('C2A0'), '%');

or specific byte sequences:

UPDATE wp_posts 
SET post_content = REPLACE(post_content, CHAR(194, 160), ' ')
WHERE post_content REGEXP CHAR(194, 160);

returned an empty result. I was only guessing at the character origin that was manifesting as Â. But this worked:

SELECT post_title, 
       HEX(post_content) as hex_content,
       LENGTH(post_content) as content_length
FROM wp_posts 
WHERE post_title = 'Bowl 712 Cherry';

It returned:

-------------------------------------------------------------------------------------------------------+----------------+
| Bowl 712 Cherry | 3C6120687265663D2268747470733A2F2F6A7477672E64776462636E632E6E65742F77702D636F6E74656E742F75706C6F6164732F323031332F30332F426F776C2D3731322D4368657272792D572E6A7067223E3C696D6720636C6173733D22616C69676E6E6F6E652073697A652D6D656469756D2077702D696D6167652D39303422207469746C653D22426F776C203731312043686572727922207372633D2268747470733A2F2F6A7477672E64776462636E632E6E65742F77702D636F6E74656E742F75706C6F6164732F323031332F30332F426F776C2D3731322D4368657272792D572D313937783330302E6A70672220616C743D22222077696474683D2231393722206865696768743D2233303022202F3E3C2F613E0D0A0D0A546865206C617374206F66203420626F776C73207475726E65642066726F6D207468652043686572727920547265652066726F6D205370616C64696E672044722E20C382204A75737420612076657279206E6963652062756420766173652E |            381 |
| Bowl 712 Cherry |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |              0 |
+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------

In case you’re wondering it was Claude AI stepping me through these operations and of course I couldn’t see the errant HEX code but Claude could…

and so, the fix!

UPDATE wp_posts 
SET post_content = REPLACE(post_content, UNHEX('C382'), '')
WHERE post_title = 'Bowl 712 Cherry';

Some forensic work after the exercise verifies:

C382 is considered as two separate bytes in UTF-8 encoding:

  • C3 is the first byte of a two-byte sequence.
  • 82 is the second byte of that sequence.
  • Together, C3 82 decodes to the Unicode character U+00C2, which is the Latin Capital Letter A With Circumflex (Â)

Could have just asked Google, What is the hex value of  ? Goggle Search Labs | AI Overview answer: The hexadecimal value for the character  (Latin capital letter A with circumflex) is 00C2 or C2. In UTF-8 encoding, it is represented as C3 82. Either would have got the job done. Good exercise though. The database has been cleaned up.

Leave a Reply

Your email address will not be published. Required fields are marked *