An adventure into the world of character sets and encodings
I’ve been a programmer for 16 years and there are two problems that always haunt me.
Date conversion
Character encoding
Keyboard maniac at Escenic making a GREAT Content Management System for the media industry
Customers all over the world: From the Daily Mirror in the UK, Thai PBS in ประเทศไทย to Dinamani in இந்தியா
Best regards
Viele Grüße
Beste ønsker
Or just
Know
to get your brain cells going
“Alpha and Ω” in a database with ISO 8859-1?
instead of letters, it’s because …
will æ, ø and å be written correctly?
Kjører
Crash course in
American Standard Code for Information Interchange
Character | Decimal | Binary |
---|---|---|
A | 65 | 1 0 0 0 0 0 1 |
B | 66 | 1 0 0 0 0 1 0 |
Character | Decimal | Binary |
---|---|---|
a | 97 | 1 1 0 0 0 0 1 |
b | 98 | 1 1 0 0 0 1 0 |
Value for upper case letter + 32 = value for the lower case letter. Brilliant!
Need for new characters that didn’t exist
→ 0 1 0 0 0 0 0 1
with 8 bits:
你好嗎?
Unicode is a
A defined list of characters recognized by the computer hardware and software. Each character is represented by a number.
Character | Code point | Name |
---|---|---|
Ω | 937 | GREEK CAPITAL LETTER OMEGA |
Å | 197 | LATIN CAPITAL LETTER A WITH RING ABOVE |
Also known as the code point of a character.
For instance “~” (tilde)?
String c = “~”; int unicodeCodepoint = (int) c.charAt(0);
String s = “🐘”; int unicodeCodepoint = Character.codePointAt(s, 0);
var c = “~”; var unicodeCodepoint = c.codePointAt(0);
http://en.wikipedia.org/wiki/~
M-x describe-char
Only a few key presses away on any Unix:
$ man ascii
$ man charset
$ man unicode
$ man utf-8
$ man iconv
… a character encoding is used to represent a repertoire of characters by some kind of an encoding system
0xxxxxxx
110xxxxx 10xxxxxx
0xxxxxxx
1110xxxx 10xxxxxx 10xxxxxx
110xxxxx 10xxxxxx
0xxxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
110xxxxx 10xxxxxx
0xxxxxxx
Unicode is a table with numeric values and names for all characters in the whole wide world.
UTF-8 is one of several ways to encode a Unicode numeric value to bytes.
HTML, HTTP & friends
Java’s internal representation of strings is UTF-16.
final String name = getNameFromFacebook(id);
$ native2ascii -encoding utf8 ghost-text-utf8 ghost-text.properties
$ cat ghost-text-utf8
ghost_title=This is a 👻
$ native2ascii -encoding utf8 ghost-text-utf8
ghost_title=This is a \ud83d\udc7b
with the Unicode escapes
.properties
files if you do:public String getPropertyFromUTF8File(final String pKey)
throws IOException {
ResourceBundle bundle = ResourceBundle.getBundle(
"ghost-text-utf8", Locale.ENGLISH);
String value = bundle.getString(pKey);
return new String(value.getBytes("ISO-8859-1"), "UTF-8");
}
@Test
public void ghostIsOneCodeUnit() {
final String ghost = "👻";
assertEquals("ghost is just one character", 1, ghost.length());
}
@Test
public void ghostIsOneCodeUnit() {
final String ghost = "👻";
assertEquals("ghost is just one character", 1, ghost.length());
}
Failed tests: ghostIsOneCodeUnit(GhostLengthFailingTest):
ghost is just one character expected:<1> but was:<4>
Use String#codePointCount(from, to):
@Test
public void ghostIsOneCodePoint() {
final String ghost = "👻";
assertEquals(
"ghost is just one character",
1,
ghost.codePointCount(0, ghost.length())
);
}
Number of visible glyphs on the screen:
BreakIterator.getCharacterInstance();
Ever seen this one?
[WARNING] File encoding has not been set, using platform encoding
UTF-8, i.e. build is platform dependent!
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
A ♥ looks so much better than \u2665
Your editor can specify the encoding when it writes the file to disk, burning a mark in it using a BOM.
An example of a UTF-8 encoded file without BOM
An example of a UTF-8 encoded file with BOM
if you’re counting bytes
When we surf on facebook.com or write Java code that consume REST, RPC over HTTP and SOAP services, the server says which encoding the contents are serialised with:
$ GET http://vg.no
..
Content-Type: text/html; charset=iso-8859-1
Why does it say charset?
Note: This use of the term “character set” is more commonly referred to as a “character encoding.” However, since HTTP and MIME share the same registry, it is important that the terminology also be shared.
Inside the HTML file itself, it’s also important that the encoding is correct so that the contents is displayed properly in the web browser:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
In addition to how the data are stored and serialised it’s also important that we take care of things on our side:
→ Fonts
-Dsun.jnu.encoding=utf-8
-Dfile.encoding=utf-8
jdbc:jtds:sybase://db01:4100/mydb?characterEncoding=utf8
jdbc:mysql://db01:3306/mydb?autoReconnect=true&\
useUnicode=true&\
characterEncoding=UTF-8&\
characterSetResults=UTF-8
$ locale -a | grep en_GB.utf8
$ export LC_ALL=en_GB.utf8
$ export LANG=en_GB.utf8
The encoding of the Java file decides how the data that are written by this Java component are written to the database.
The file encoding only decides how the characters in the Java file itself are stored and displayed:
/**
* @author Søren Westergård
*/
final static String PRODUKT = "UFØ";
Data encoding, on the other hand, decides how the data (which the Java program writes or reads) are read and written:
new OutputStreamWriter(out, "UTF-8");
The encoding inside system X affects the data it sends out and how our system saves these data in our system.
It’s irrelevant that system X stores its data internally as Windows 1252 if the web services through which it exposes these data returns XML encoded as UTF-8.
<img src=“different-encodings.svg” “different encodings”/>
User (ok!) → Flex (ok!) → BlazeDS (ok!) → Java (ok!) → Database (BANG!)
What’s wrong with this statement?
mysql> create database mydb
character set utf8
collate utf8_general_ci;
The MySQL utf8
table & column encoding is not real UTF-8
Only 1-3 byte characters supported.
For full UTF-8 support, use the utf8mb4
type instead.
This will fix obscure errors like:
This will fix obscure errors like:
Default collation in MySQL is “Swedish Latin 1”.
again
“Alpha and Ω” in a database with ISO 8859-1?
instead of letters, it’s because …
will æ, ø and å be written correctly?
Kjører
for the impatient
END OF TRANSMISSION