Squares & � marks

An adventure into the world of character sets and encodings

by torstein@escenic.com

👻

I've been a programmer for 16 years and there are two problems that always haunt me.

👻

Date conversion

Character encoding

Even Jira

jira encoding tweet

whois

Keyboard maniac at Escenic making a GREAT Content Management System for the media industry

escenic

Customers all over the world: From the Daily Mirror in the UK, Thai PBS in ประเทศไทย to Dinamani in இந்தியா

Let's talk about

Best regards

Viele Grüße

Beste ønsker

Or just

���

Goals for this talk

Know

  • What a *character set* is and what a *character encoding* is
  • Differentiate encoding problems from display problems
  • How to use UTF-8 everywhere
But first, a wee

Quiz

to get your brain cells going

Unicode is ....

  1. An encoding
  2. A character set

UTF-8 is ....

  1. An encoding
  2. A character set

Can you store

"Alpha and Ω" in a database with ISO 8859-1?

  1. Yes
  2. No

If you see big squares

instead of letters, it's because ...

  1. Encoding problem
  2. Missing font
  3. Decoding problem

My.java has Windows 1252 encoding

will æ, ø and å be written correctly?

  1. It depends
  2. Yes
  3. No

What has happened?

Kjører

  1. Wrong font
  2. Encoding/decoding mismatch

Part I

Crash course in

Character sets & encodings

ASCII

American Standard Code for Information Interchange

usa & uk

ASCII

  • An absolute genius of a standard
  • ...as long as you speak English

ASCII

<th>Character</th>
<th>Decimal</th>
<th>Binary</th>
<td>A</td><td>65</td><td>1 0 0 0 0 0 1</td>
<td>B</td><td>66</td><td>1 0 0 0 0 1 0</td>
  • One character corresponds to one numeric value
  • 7 bit

ASCII

<th>Character</th><th>Decimal</th><th>Binary</th>
<td>a</td><td>97</td><td>1 1 0 0 0 0 1</td>
<td>b</td><td>98</td><td>1 1 0 0 0 1 0</td>

Value for upper case letter + 32 = value for the lower case letter. Brilliant!

Then came the Europeans

Columbus

Need for new characters that didn't exist

No room in the inn

  • All the 127 rooms were taken
  • ...so they added another zero

0 1 0 0 0 0 0 1

256 characters, hurrah!

Europeans could now enter the computer age

Endless

possibilities

with 8 bits:

  • Code pages
  • ISO-8859-*

Then came the Asians

Dragon

你好嗎?

A whole lot of nonsense ☠

  • Many made their own character sets
  • Incompatibility all around

Finally peace

Unicode

Unicode

  • Caters for all characters and letters in all dead and spoken languages
  • Has today more than 110 000 characters
  • And plenty of room to spare

Unicode is a

Character set

Character set

A defined list of characters recognized by the computer hardware and software. Each character is represented by a number.

webopedia.com

Each entry has

  • A numeric value (code point)
  • A name

Unicode examples

<th>Character</th>
<th>Code point</th>
<th>Name</th>
<td>Ω</td>
<td>937</td>
<td>GREEK CAPITAL LETTER OMEGA</td>
<td>Å</td>
<td>197</td>
<td>LATIN CAPITAL LETTER A WITH RING ABOVE</td>

How to find the Unicode value

Also known as the code point of a character.

For instance "~" (tilde)?

Java

String c = "~";
int unicodeCodepoint = (int) c.charAt(0);

Java 🐘

String s = "🐘";
int unicodeCodepoint =  Character.codePointAt(s, 0);

JavaScript

var c = "~";
var unicodeCodepoint = c.codePointAt(0);

Wikipedia

http://en.wikipedia.org/wiki/~

Emacs

M-x describe-char

Emacs

emacs describe char

RTFM

Only a few key presses away on any Unix:

$ man ascii
$ man charset
$ man unicode
$ man utf-8
$ man iconv

So what's UTF-8 then?

UTF-8 is a character

encoding

Character encoding

... a character encoding is used to represent a repertoire of characters by some kind of an encoding system

wikipedia.org

UTF-8

  • ASCII compatible
  • A standard so simple you can explain it on a napkin

Who is this?

ken and ritchie

linuxbeard.com

Created over a meal

  • Ken Thompson & Rob Pike went for dinner, September 1992.
  • UTF-8 was invented by Ken Thompson
  • Wrote it down on a placemat

Killer features

  1. All ASCII strings are valid UTF-8
  2. An ASCII string encoded in UTF-8 has 0 as the first bit.
  3. Easy to navigate and find current, previous and next character.
  4. Never eight 0s in a row

UTF-8 - 1 byte

                                   0xxxxxxx

UTF-8 - 2 bytes

                          110xxxxx 10xxxxxx
                                   0xxxxxxx

UTF-8 - 3 bytes

                 1110xxxx 10xxxxxx 10xxxxxx
                          110xxxxx 10xxxxxx
                                   0xxxxxxx

UTF-8 - 4 bytes

       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
                1110xxxx 10xxxxxx 10xxxxxx
                         110xxxxx 10xxxxxx
                                  0xxxxxxx

The difference between Unicode and UTF-8

  • Unicode is a table with numeric values and names for all characters in the whole wide world.

  • UTF-8 is one of several ways to encode a Unicode numeric value to bytes.

Why is this important?

  • UTF-8 : ASCII compatible
  • UTF-16 : not ASCII compatible (Windows & Java)
  • UTF-32 : not ASCII compatible

Part II

Character encoding in Java

HTML, HTTP & friends

Java

Java's internal representation of strings is UTF-16.

final String name = getNameFromFacebook(id);

Specify the encoding whenever you can

Resource bundles

  • Java resource bundles must be encoded in ISO-8859-1.
  • Characters that don't fit into ISO-8859-1 must therefore be represented using Unicode escape notation:
$ native2ascii -encoding utf8 ghost-text-utf8 ghost-text.properties

A ghostly example 👻

$ cat ghost-text-utf8
ghost_title=This is a 👻

$ native2ascii -encoding utf8 ghost-text-utf8
ghost_title=This is a \ud83d\udc7b
Before you

fall in love💘

with the Unicode escapes

cons

  • The escapes are translated before the code is compiled
  • Harmless comments become hidden backdoors
  • Or just break the build 💣

Resource bundles

  • Possible to use UTF-8 in your .properties files if you do:
public String getPropertyFromUTF8File(final String pKey)
  throws IOException {

  ResourceBundle bundle = ResourceBundle.getBundle(
    "ghost-text-utf8", Locale.ENGLISH);
  String value = bundle.getString(pKey);
  return new String(value.getBytes("ISO-8859-1"), "UTF-8");
}

Can you trust String#length()?

@Test
public void ghostIsOneCodeUnit() {
  final String ghost = "👻";
  assertEquals("ghost is just one character", 1, ghost.length());
}

Can you trust String#length()?

@Test
public void ghostIsOneCodeUnit() {
  final String ghost = "👻";
  assertEquals("ghost is just one character", 1, ghost.length());
}
Failed tests: ghostIsOneCodeUnit(GhostLengthFailingTest):
  ghost is just one character expected:<1> but was:<4>

A safer bet

Use String#codePointCount(from, to):

@Test
public void ghostIsOneCodePoint() {
  final String ghost = "👻";
  assertEquals(
    "ghost is just one character",
    1,
    ghost.codePointCount(0, ghost.length())
  );
}

Text rendering control

Number of visible glyphs on the screen:

BreakIterator.getCharacterInstance();

Maven

Ever seen this one?

[WARNING] File encoding has not been set, using platform encoding
          UTF-8, i.e. build is platform dependent!

Make your builds safe

<properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

You can now use UTF-8 in your source files

A ♥ looks so much better than \u2665

XML & JSON

<?xml version="1.0" encoding="utf-8"?>

Encoding in ANY file?

Your editor can specify the encoding when it writes the file to disk, burning a mark in it using a BOM.

Did you say BOM?

BOM

  • Is something that we can use if cannot write the encoding into the file's contents.
  • For instance when we when write a plain text file.

BOM

without bom

An example of a UTF-8 encoded file without BOM

BOM

without bom

An example of a UTF-8 encoded file with BOM

BOM - beware

if you're counting bytes

  • Some encodings automatically add a BOM
  • UTF-16 adds a two byte BOM

HTTP

When we surf on facebook.com or write Java code that consume REST, RPC over HTTP and SOAP services, the server says which encoding the contents are serialised with:

$ GET http://vg.no
..
Content-Type: text/html; charset=iso-8859-1

Wait!

Why does it say charset?

MIME is to blame

MIME is to blame

HTTP

  • HTTP wanted to keep the terminology consistent, but acknowledges:

Note: This use of the term "character set" is more commonly referred to as a "character encoding." However, since HTTP and MIME share the same registry, it is important that the terminology also be shared.

HTML

Inside the HTML file itself, it's also important that the encoding is correct so that the contents is displayed properly in the web browser:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

Runtime environment

In addition to how the data are stored and serialised it's also important that we take care of things on our side:

→ Fonts

JVM parameters

-Dsun.jnu.encoding=utf-8
-Dfile.encoding=utf-8

JDBC connection string

jdbc:jtds:sybase://db01:4100/mydb?characterEncoding=utf8
jdbc:mysql://db01:3306/mydb?autoReconnect=true&amp;\
                            useUnicode=true&amp;\
                            characterEncoding=UTF-8&amp;\
                            characterSetResults=UTF-8

UNIX locale

$ locale -a | grep en_GB.utf8
$ export LC_ALL=en_GB.utf8
$ export LANG=en_GB.utf8

Myth #1

The encoding of the Java file decides how the data that are written by this Java component are written to the database.

Myth #1 busted

The file encoding only decides how the characters in the Java file itself are stored and displayed:

/**
 * @author Søren Westergård
 */
 final static String PRODUKT = "UFØ";

Myth #1 busted

Data encoding, on the other hand, decides how the data (which the Java program writes or reads) are read and written:

new OutputStreamWriter(out, "UTF-8");

Myth #2

The encoding inside system X affects the data it sends out and how our system saves these data in our system.

Myth #2 busted

It's irrelevant that system X stores its data internally as Windows 1252 if the web services through which it exposes these data returns XML encoded as UTF-8.

Part III

The database

The database

  • Many databases in Europe used (and several still use!)
    [ISO-8859-1](http://no.wikipedia.org/wiki/ISO_8859-1) encoding.
  • If someone attempts to write a character into these databases that's
    not covered by
    [ISO-8859-1](http://no.wikipedia.org/wiki/ISO_8859-1), for instance "*–*"
    [Unicode EN DASH](http://no.wikipedia.org/wiki/Unicode)
    (hyphen)...
  • the database will throw an <span class="fragment highlight-red">error</span> up to the
    web application.

User → .. → DB 💣

User (ok!) → Flex (ok!) → BlazeDS (ok!) → Java (ok!) → Database (BANG!)

MySQL users beware

What's wrong with this statement?

mysql> create database mydb
       character set utf8
       collate utf8_general_ci;

MySQL utf8

The MySQL utf8 table & column encoding is not real UTF-8

MySQL utf8

Only 1-3 byte characters supported.

MySQL utf8mb4

For full UTF-8 support, use the utf8mb4 type instead.

MySQL utf8mb4

This will fix obscure errors like:

jira encoding

MySQL utf8mb4

This will fix obscure errors like:

jira encoding

When you thought you were done

Once your full stack is Unicode friendly

Collation may still haunt you 👻

Hæ?

Collation

  • How the database handles strings
  • Sorting
  • Comparing

Fun fact

Default collation in MySQL is "Swedish Latin 1".

So to sum ut, let's take that

Quiz

again

Unicode is ....

  1. An encoding
  2. A character set

UTF-8 is ....

  1. An encoding
  2. A character set

Can you store

"Alpha and Ω" in a database with ISO 8859-1?

  1. Yes
  2. No

If you see big squares

instead of letters, it's because ...

  1. Encoding problem
  2. Missing font
  3. Decoding problem

My.java has Windows 1252 encoding

will æ, ø and å be written correctly?

  1. It depends
  2. Yes
  3. No

What has happened?

Kjører

  1. Wrong font
  2. Encoding/decoding mismatch

Summary

for the impatient

  • Use UTF-8 everywhere

Further exploration into the world of Unicode

Q?

U+0004

END OF TRANSMISSION

🌐 http://skybert.net/talks

torstein@escenic.com

🐦 @torsteinkrause