Fun with Unicode

I’m on IRC with a Coworker in Israel. While my Hebrew is pretty poor, I have attempted to communicate short sayings with him like:

בוקר טוב

And of course


However, his response are getting garbled. I blame his IRC program (mIRC) since it is running Windows, but I’d like to confirm that this is our problem. Eventually I’ll sniff the packets on the wire, but what should I look for?

I took the word שלם and dumped it into open office, and then saved the file as straight text. A hex dump of it looks like this:

hexdump -C ~/Documents/shalom.txt
00000000 ef bb bf d7 a9 d7 9c d7 9d 0a |……….|

Just the letter ש looks like this

hexdump -C ~/Documents/shin.txt
00000000 ef bb bf d7 a9 0a |……|

A File with just the letter ‘a’ in it looks like this:

hexdump -C ~/Documents/a.txt
00000000 ef bb bf 61 0a |…a.|

To check the file type of these:

file ~/Documents/a.txt
/home/ayoung/Documents/a.txt: UTF-8 Unicode (with BOM) text

The value 61 is the ascii value for ‘a’, so we know where to look. Note that this file is one byte shorter than the other two. This aligns with UTF-8. The letter ‘a’ maps directly to the ascii value for ‘a’, where as the Hebrew characters us the ascii expansion scheme. I’ll translate this raw binary back to the Unicode value:

If shin dumps to d7 a9

D7a9=1101 0111 1010 1001

110 Means it falls into the range of U+0080-U+07FF And is translated from the Unicode value by the pattern 110yyyxx 10xxxxxx

So yyy= 101 and xx xxxxxx = 11 101001

Note the yyy. The max this portion can be is 7, so you left pad it with 0 to get 0000 0101

And the rest becomes 1110 1001. This is the hex value 0 5 E 9, Which is in the range U+0590 to U+05FF as expected.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>