Converting Documents to UTF-8
Earlier this summer I moved my company’s website to a new server and upgraded all of the software to the latest versions. The upgrade when remarkably smoothly except for one issue with the character encodings on the website. My company’s web site contains a great deal of bilingual content in English and Français and all of the accented french characters (ç á í ó ú) were displaying garbled iso8859-1 encoded strings of multibyte characters: � instead of é forinstance.
Upgrading from mysql 4.0 to 4.1 had put my database into utf-8 mode so I either had to figure out how to convert the database to back to latin1 or bite the bullet and go utf-8 across the entire website.
I realized if I jumped over this small hurtle I would be better off in the long run so I began looking into converting all of the static content to utf-8.
iconv
iconv is a tool which performs character set conversions on files. It is quite simple to use, simply specify the source and target encodings of the file and the reencoded content is sent to stdout.
iconv -f iso8859-1 -t utf-8 lain1_file.txt > utf-8_file.txt
Changing fileencoding within Vim
If you compiled vim with multibyte support you can use it for editing and even converting unicode documents.
set fileencoding=utf-8
to convert from the command line you can pass vim commands to execute with the -c option
vim -c 'set fileencoding=utf-8' -c 'wq' latin1_file.txt
Batch File Conversion with Vim
Now in my case I needed to convert the entire source tree to utf-8. By using find’s -exec option i was able to use vim to convert each php document it found in my search path.
I used vim because if it does encounter a problem while converting it will pause and prompt me for action and I trust it to be to be non-destructive.
find /path/to/search -name '*.php' -exec vim -c 'set fileencoding=utf-8' -c 'wq' {} \;
This command flew over approximately3500 php documents and converted them all to utf-8 in about 3 mins.
