Tuesday, January 22, 2013

Tips for Easy UTF-8 Ruby Adventuring

Getting that search box working in Esperanto? Cherokee? Pull on your wading boots because your walking into deep waters. I can't make you an expert in UTF-8 but I can recommend that you know the following stuff before you venture forth:
  • Make sure your DB is configured to support UTF-8. Configuration is DB specific so please see documentation for your respective DB.
  • Make sure your Ruby source code supports UTF-8. You might be surprised to find out that ruby 1.9 encodes your source code as US-ASCII by default. Take some time to learn about the magic encoding comment. 
  • Make sure your regexes support UTF-8. Use posix character properties instead of standard ASCII character classes like \w \s \d
  • Upcase and downcase won't work for UTF-8 strings, but there is a gem for that! Checkout unicode_utils 
  • If you want to compare unicode strings in MySQL, have a look at collation in their documentation and know the difference between: utf8_general_ci and utf8_bin.  You might be surprised how loose the default matching is.

No comments: