When we started building
DropSend, we decided to support all languages worldwide from the start. The interface is currently in English only, but the application can send, store, sort and process your data whatever language you want. As a result, we have a good number of customers out east.
To support worldwide languages, you need to use UTF-8 encoding for your web pages, emails and application, rather than ISO 8859-1 or another common western encoding, since these don't support characters used in languages such as Japanese and Chinese.
Happily, UTF-8 is transparent to the core Latin characterset, so you won't need to convert all your data to start using UTF-8. But there are a number of other issues to deal with. In particular, because UTF-8 is a multibyte encoding, meaning one character can be represented by more one or more bytes. This causes trouble for PHP, because the language parses and processes strings based on bytes, not characters, and makes mincemeat multibyte strings - for example, by splitting characters 'in half', bodging up regular expressions, and rendering email unreadable.
There are a number of great articles online about UTF-8 and how it works - Joel Spolski's comes to mind - but very few about how to actually get it working with PHP and iron out all the bugs. So, here to save you the time we put in, is a quick cheatsheet and info about a few common issues.
1. Update your database tables to use UTF-8
CREATE DATABASE db_name
CHARACTER SET utf8
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci
DEFAULT COLLATE utf8_general_ci
;
ALTER DATABASE db_name
CHARACTER SET utf8
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci
DEFAULT COLLATE utf8_general_ci
;
ALTER TABLE tbl_name
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci
;
2. Install the mbstring extension for PHP
Windows:
download the dll if it's not in your PHP extensions folder, and
uncomment the relevant line in your php.ini file:
extension=php_mbstring.dll
Linux: yum install php-mbstring
3. Configure mbstring
Do this in php.ini, httpd.conf or .htaccess. (Remember to prepend these with 'php_value ' in httpd.conf or .htaccess.)
mbstring.language = Neutral ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On ; HTTP input encoding translation is enabled
mbstring.http_input = auto ; Set HTTP input character set dectection to auto
mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order = auto ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset = UTF-8 ; Default character set for auto content type header
4. Deal with non-multibyte-safe functions in PHP
The fast-and-loose way to do this is with the following php configuration:
mbstring.func_overload = 7 ; All non-multibyte-safe functions are overloaded with the mbstring alternatives
But there are problems with this. php.net has a warning
about this potentially affecting the whole server. And even if this
isn't an issue for you, mbstring can make a mess of binary strings.
So,
a better route is to search your application code for the following
functions, and replace them with mbstring's 'slot-in' alternatives:
mail() -> mb_send_mail()
strlen() -> mb_strlen()
strpos() -> mb_strpos()
strrpos() -> mb_strrpos()
substr() -> mb_substr()
strtolower() -> mb_strtolower()
strtoupper() -> mb_strtoupper()
substr_count() -> mb_substr_count()
ereg() -> mb_ereg()
eregi() -> mb_eregi()
ereg_replace() -> mb_ereg_replace()
eregi_replace() -> mb_eregi_replace()
split() -> mb_split()
5. Sort out HTML entities
The
htmlentities() function doesn't work automatically with multibyte
strings. To save time, you'll want to create a wrapper function and use
this instead:
/**
* Encodes HTML safely for UTF-8. Use instead of htmlentities.
*
* @param string $var
* @return string
*/
function html_encode($var)
{
return htmlentities($var, ENT_QUOTES, 'UTF-8') ;
}
6. Check content-type headers
Check
through your code for any text-based content-type headers, and append
the UTF-8 charset, so the browser knows what it's working with:
header('Content-type: text/html; charset=UTF-8') ;
You should also repeat this at the top of HTML pages:
<meta http-equiv="Content-type" value="text/html; charset=UTF-8" />
7. Update email scripts
Email
can be tricky. You'll need to update the content-type for any emails
and text-based mime parts to use UTF-8 encoding. You'll also need to
alter the way in which headers are encoded to use UTF-8. mbstring
provides a function mb_encode_mimeheader() to handle this for you, but
it does make a mess of address lists: you'll need to encoding the name
and address parts seperately, then compile them into an address list.
Be sure to encode the subject and other headers too - Korean speakers will tend to put Korean text for the subject.
9. Check binary files and strings
Finally,
double check any binary files and strings handled by PHP, particularly
uploads, downloads and encryption. In some cases it may be necessary to
revert to ASCII just before a download or processing a binary string.