How to configure your web application to correctly deal with character set/encoding issues
My interest in character set/encoding issues started way back in college when I maintained the school newspaper website. I noticed that sometimes when I cut-and-pasted text into the wysiwig that I was using that it would show up as garbage or question marks.
Let me tell you...if you've never had the pleasure of troubleshooting such issues, consider yourself lucky. There's nothing like trying to fix, much less define, a problem that you can't even see. That's why it's taken me years of scouring the internet to finally get to the point where I feel like I have my mind wrapped around the whole issue.
So below i'll attempt to lay out everything i've learned as concisely as possible. I'll talk specifically about PHP/MySQL, but much of the information applies to any platform.
Background
I'll just link to the best places i've found to read up on the background. Read these first if you want to understand whats going on:
The Details
In order to get your website working harmoniously with one character set, you first have to pick one. I picked ISO-8859-1 (sometimes referred to as latin1). It's the most popular English (Latin) language set, though it doesn't display many foreign characters.
Unfortunately, despite the fact that Unicode is the end-and-be-all PHP is really horrible at dealing with Unicode (specifically multi-byte strings). Since I've got enough to worry about without having to also check every function i'm using for multi-byte compatibility, ISO-8859-1 is right for me, for now. I hear PHP6 will fully support mb strings.
In order to get your web application working correctly with one character set, there are basically three parts you must take care of:
- The database
- The web server
- The web page itself
Database config
In a perfect world, you would compile your database to default to your preferred character set. I won't cover that here.
Assuming you cannot do that, you need to create your tables using the correct set. Keep in mind, at least on MySQL, the character set can be configured all the way down to the column level. Here we will just set it at a database level. (See here for much more detail):
CREATE DATABASE database CHARACTER SET utf8 COLLATE utf8_general_ci;
Next, we need to configure how the database delivers the results of queries to the calling application. Assuming you don't (or can't) ensure this is done at compile time, you can use the following command before sending a query:
SET NAMES charset
According to the linked article above, this is shorthand for setting the character_set_client, character_set_results, and collation_connection variables in MySQL. A good place to put this query is in the constructor of your database access class.
Web Server
Assuming you are using apache, you need to edit a setting in your httpd.conf file (the primary apache config file).
AddDefaultCharSet charset
Set this to be what you prefer apache to tell the client that the default character set is in case one is not specified. (This is important, as in some cases the client will trust this even if one is specified in the document.)
Also, there are options in whatever server side language you're using to set this variable in outgoing headers. In PHP you use the header() function. See the PHP documentation for details.
The web page
Finally, you should specify the character set of the page you're serving with a meta tag at the top below the tag. It looks like this:
<meta http-equiv="Content-Type" content="text/html; charset=charset" />
Getting obsessive
Most browsers support specifying the character set that a form should accept (and what the browser will convert the text to if it is not correct). To manually set this, include the following attribute in the