UTF-8 for the Impatient | skybert.net

UTF-8 for the Impatient


This is a no nonsense guide to get your full stack Unicode compliant. No more squares, no more question marks.

[bash] [database] [debian & ubuntu] [editors] [emacs] [fonts] [html] [http] [java] [maven] [linux & unix] [vim]

Ready, set, go!

To get you off to a good start, I'll mention that Unicode is a character set and UTF-8 is one of several encodings for it. For all practical use cases, UTF-8 is the best Unicode encoding and the one you shold use throughout your stack.

I highly recommend reading the UTF-8 and Unicode FAQ for Unix/Linux guide and browse the beautifully presented Unicode Character Table, but for now, let's jump straight into the nitty gritty details on how to turn your stack, your app, into a fully Unicode speaking and reading system.

UNIX, Linux & Cygwin

There are two important environment variables you need to set to a UTF-8 locale. This locale must exist on your machine.

Debian, Ubuntu & friends

See all UTF-8 locales on your system using the locales command from the libc-bin package:

$ locales -a | grep UTF-8

If you cannot see any, run this command, select some and generate them:

# dpkg-reconfigure locales

System default locale

Running dpkg-reconfigure locales will also set the default encoding on your system.

These defaults (one for locale and one for language) are written to the file /etc/default/locale, so if you want to figure out what the default, fallback encoding on a Debian based system, you can look here.

BASH

Set your LANG and LC_ALL environment variables to one of the UTF-8 locales available on your system.

export LANG=en_GB.utf8
export LC_ALL=en_GB.utf8

If you put these lines in your $HOME/.bashrc, it'll be activated whenever you log in or open a new shell.

Terminal

konsole

Your terminal emulator must have support for Unicode. I love urxvt, even though it only support 3 byte UTF-8 characters. for 4 byte UTF-8, you have to use terminals like gnome-terminal or konsole.

Fonts

If you're seeing squares instead of characters, it means that the font you're using is missing a glyph to represent that character.

The trick is to pick a font which has support for the characters that you need. There are a good number of fonts which support everything up to and including the 3 byte UTF-8 characters, but not the 4 byte characters (like 👻). In some contexts, you can have a primary font and several fall back fonts which may provide the more exotic characters.

My favourite fonts at the moment are:

To use Adobe Source Code Pro, I start my urxvt like this:

urxvt -fn 'xft:Source Code Pro:pixelsize=14'

DB

MySQL, MariaDB & Percona

The MySQL utf8 table & column encoding is not real UTF-8 Only 1-3 byte characters supported. For full UTF-8 support, use the utf8mb4 type instead.

Create new DBs with 4 byte UTF-8 character support

mysql> create database mydb character set utf8mb4 collate utf8mb4_general_ci;

Check the default encoding and collation

mysql> select schema_name, default_character_set_name, default_collation_name
       from information_schema.schemata;

Java

Resource bundles

Contrary to popular myth, it is possible to use UTF-8 in your .properties files (aka resource bundles). You just need to do this:

public String getPropertyFromUTF8File(final String pKey)
  throws IOException {

  ResourceBundle bundle = ResourceBundle.getBundle(
    "ghost-text-utf8", Locale.ENGLISH);
  String value = bundle.getString(pKey);
  return new String(value.getBytes("ISO-8859-1"), "UTF-8");
}

File encoding

The encoding of the actual Java source files only affect the characters which you write in the .java file itself. The file encoding does not affect in any way the data that flows through the Java program you write.

Maven

Have you ever seen this one?

[WARNING] File encoding has not been set, using platform encoding
          UTF-8, i.e. build is platform dependent!

Just add this to your POM to get rid of it:

<properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

You can now use UTF-8 in your source files. A ♥ looks so much better than \u2665

Another thing to check, is that your Maven project hasn't overwritten the compiler encoding:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-compiler-plugin</artifactId>
  [..]
  <configuration>
   <compilerArguments>
      <encoding>ISO-8859-1</encoding>
    </compilerArguments>
  </configuration>
</plugin>

The above alters Java's native Unicode code handling:

String germanWord = "Veröffentlicht";
String germanWordInDB = readWordFromDB();
assertEquals(germanWord, germanWordInDB);

This will fail because the code points of the germanWord will be completely off since the String is defined in the .java file itself. Even though the encoding of the source file is UTF-8, javac will interpret it as ISO-8859-1 (i.e. recode it) when running with the above setting in Maven.

To sort this non sense out, just remove the <compilerArguments/> line from your POM. Maven will then pick up project.build.sourceEncoding mentioned above (or fall back to your system's default encoding).

JDBC

Add the following to your JDBC connection string:

useUnicode=true&amp;characterEncoding=UTF-8&amp;characterSetResults=UTF-8"

JVM parameters

This sets the encoding used when parsing the command line arguments, reading environment variables and for the JVM to get name of the main Java class read more here.

-Dsun.jnu.encoding=utf-8

This sets the encoding of the files written by java as well as the encoding of JTextField and JTextArea and their sub classes:

-Dfile.encoding=utf-8

Default language/locale in servlets

If the client hasn't requested any specific language in a Accept-Language HTTP request header, ServletRequest#getLocales() will return the system-default-locale.

HTTP

HTTP calls a character encoding, character set (!):

Content-Type: text/html; charset=iso-8859-1

The one to blame is MIME, the specification which gives bold text and pictures with our emails. MIME, RFC 2045, May 1996, used the term "charset", so HTTP 1.0 RFC 1945, May 1996 wanting to keep the terminology consistent, also went with this term. It did add a note, however:

Note: This use of the term "character set" is more commonly referred to as a "character encoding." However, since HTTP and MIME share the same registry, it is important that the terminology also be shared.

No wonder people get confused!

HTML

Since HTTP uses charset to mean character encoding, HTML does too. This meta element ensures that UTF-8 encoded text is rendered correctly on your page:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

If you're usings forms, you can specify the encoding(s) the server accepts like this:

<form
  method="post"
  accept-charset="UTF-8"
  enctype="multipart/form-data"
>
...
</form>

XML

The XML specification says the standard encoding is UTF-8. All XML parsers must as a minimum support UTF-8

<?xml version="1.0" encoding="utf-8"?>

Server Sent Events (SSE)

The Server Sent Events are always UTF-8 encoded

Converting a file to UTF-8 on the command line

To convert one (or a thousand) text files to use UTF-8 on the command line, use the standard iconv utility, it's a part of the GNU C library. On Debian, it's provided by the libc-bin package:

$ iconv -f ISO-8859-1 -t UTF-8 my-file.xml -o my-file.xml.utf8

Checking encoding of a text file from the command line

$ file -i /tmp/test.txt.latin1
/tmp/test.txt.latin1: text/plain; charset=iso-8859-1

Editors

VIM

On my machine, using VIM 7.4 and UTF-8 compatible locale settings (see the notes on LC_ALL above), files that contain non-ASCII characters are automatically saved using UTF-8 encoding.

If you prefer to be explicit about it, to always use UTF-8, add this to your .vimrc:

set encoding=utf-8
set fileencoding=utf-8

Emacs

Emacs will respect whatever encoding an existing file uses, but you can ask it to prefer UTF-8 when it has a choice (like creating a new file):

(prefer-coding-system 'utf-8-unix)

You may also ask Emacs to use a specific Unicode friendly font:

(set-face-attribute 'default nil
                    :family "Source Code Pro"
                    :height 100
                    :weight 'normal
                    :width 'normal)

EditorConfig

To make all EditorConfig compliant editors use UTF-8 encoding, set this in the project's .editrconfig:

[*]
charset = utf-8

JSPs

Put the following line at the top of your page; it will ensure that the Content-Type HTTP header is set appropriately. Use whatever mime type you want, but most if not all of them allow the charset parameter.

<%@page
  contentType="text/html; charset=UTF-8"
  pageEncoding="UTF-8"
%>

If you omit this, the default value is text/html ; charset=ISO-8859-1!


Licensed under CC BY Creative Commons License ~ gmail torstein.k.johansen @ gmail ~ twitter @torsteinkrause ~