Fun with UTF-8, PHP and MySQL
Posted by Neil on 14th June 2007. Last updated on 14th May 2008.
So you've read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and you're STILL getting characters like £ instead of £ or ’ instead of one of those "smart" quotes? Character set problems are seen quite commonly on the web, either when data has been incorrectly stored or the application is sending an incorrect (or missing) character set to the browser.
Why is it happening?
First, it's handy to know a bit about UTF-8. Skip this if you're already familiar.
UTF-8 uses one or more 8-bit bytes to store a single character, unlike ASCII and friends which use only one byte per character. It is more space-efficient than its cousins (UTF-16, UTF-32) when the majority of the characters can be encoded as a single byte, as is the case with most English text, but with the added benefit that you can still store any character under the sun should you need to. It uses the most significant bits of each byte as continuation bits (to signify that the following byte(s) form part of the same character). It is for this reason that improperly-displayed UTF-8 results in weird characters.
UTF-8 is backwards-compatible with ASCII — all characters up to 127 are identical in both encodings. This at least makes English text legible if the UTF-8 is interpreted incorrectly as ASCII or ISO 8859 character sets. However, it's these incorrect interpretations that cause the odd characters to appear.
As an example, let's take a pound sign (a real pound sign for you non-British types who call a hash a pound). In ISO 8859-1, the £ character has an ordinal value of 163 (0xA3 in hex) and by coincidence (or not), its Unicode code point is U+00A3. However, UTF-8 cannot store values above 127 in a single byte -- the encoding demands we use two. Omitting the grizzly details of the actual encoding process, you end up with the 2-byte sequence 0xC2A3, which just happens to correspond to the string "£" when expressed in ISO 8859-1.
Unfortunately, PHP doesn't yet support UTF-8 natively in its numerous string handling functions (version 6 will when released), but that doesn't mean you can't work with it -- you just have to be a bit careful. Let's take strlen() for example: with plain ASCII text, strlen() returns the number of characters in a string. It does this by counting the number of bytes used to hold the data. It doesn't know about (and cannot detect) UTF-8 and will blindly count the number of bytes, not the actual number of characters. Hence, the presence of any multibyte characters in your string will give you an incorrect length.
A problem you will inevitably face is when a user takes advantage of another application to create some text which gets pasted into your HTML form and submitted. Microsoft Word, for example, uses Unicode internally and converts characters like quotes and dashes into "smart quotes" and em- and en-dashes automatically. These are typographically correct, but the symbols lie outside the ASCII character set so when copied and pasted, the text is sent as UTF-8 and you end up with multibyte characters all over the place. If you store this text and later send it back to a browser without informing it that you are sending UTF-8, extra characters will appear.
Fixing the output
The solution is to make sure you set the character set correctly. This can be accomplished with a Content-type header (the preferred way) or with a meta tag in the head of your document:
Content-type: text/html; charset=utf-8 <meta http-equiv="content-type" content="text/html; charset=utf-8">
Beware that the HTTP headers supersede meta tags!
Storing text in MySQL tables
Now the browser knows to expect UTF-8, it can render the output properly. There is, however, another gotcha when using MySQL to store UTF-8 encoded text. MySQL 5 supports multiple character sets and you can set the default character set of a table when you create it, like so:
CREATE DATABASE `test` DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
If we create a table, UTF-8 will now be used as the default for all textual fields (CHAR, VARCHAR, TEXT, etc):
CREATE TABLE `utf8` (`info` TEXT NOT NULL);
So you're all geared up to store UTF-8 data, and you go ahead and insert some rows. Output to the browser looks fine. But is it?
MySQL needs to know the character set of the data you are sending to it in your queries. The default connection character set is ISO 8859-1 (latin1) -- treating all your supplied data as if it was ISO 8859-1, which is then automatically converted to the underlying character set of the column (UTF-8 in our case). Taking our earlier example, this means that the two-byte pound sign is perceived as two ISO 8859-1 characters: Â and £. MySQL will then encode these characters separately as UTF-8, requiring 2 bytes each -- a sort of double-encoding. We end up using 4 bytes to store a single pound sign! When selecting data from the table, the reverse occurs -- MySQL converts the UTF-8 back into ISO 8859-1, and the user's browser (correctly) interprets the two bytes as a pound sign. The problem here is that while everything looks correct, it's needlessly using extra storage space and CPU cycles in the conversion.
It's very simple to change the connection character set and avoid these overheads. The query "SET NAMES 'utf8'" instructs MySQL to treat incoming data as UTF-8, which can be directly inserted into columns with a matching charset. In practice, you may issue this query after initialising your connection to the MySQL server so all future communication will be in UTF-8.
One major problem occurs when you have a table full of double-encoded UTF-8 text because it was inserted before you knew about changing the connection character set. If you then add in the extra "SET NAMES" query and output the retrieved text to a browser, you will notice all the extra characters have crept in.
You might be tempted to be selective about where you use SET NAMES 'utf8', but it's quite easy to fix all of the data in place and end up with nice clean tables. Here's some pseudocode that pulls valid UTF-8 from the table and reinserts it after switching the connection charset:
SET NAMES 'latin1'; SELECT id, col1, col2, col3 FROM table; SET NAMES 'utf8'; for each row INSERT INTO table (col1, col2, col3) VALUES (, , ) WHERE id = ;
