I’m on IRC with a Coworker in Israel. While my Hebrew is pretty poor, I have attempted to communicate short sayings with him like:
בוקר טוב
And of course
של×
However, his response are getting garbled. I blame his IRC program (mIRC) since it is running Windows, but I’d like to confirm that this is our problem. Eventually I’ll sniff the packets on the wire, but what should I look for?
I took the word ×©×œ× and dumped it into open office, and then saved the file as straight text. A hex dump of it looks like this:
hexdump -C ~/Documents/shalom.txt
00000000 ef bb bf d7 a9 d7 9c d7 9d 0a |……….|
Just the letter ש looks like this
hexdump -C ~/Documents/shin.txt
00000000 ef bb bf d7 a9 0a |……|
A File with just the letter ‘a’ in it looks like this:
hexdump -C ~/Documents/a.txt
00000000 ef bb bf 61 0a |…a.|
To check the file type of these:
file ~/Documents/a.txt
/home/ayoung/Documents/a.txt: UTF-8 Unicode (with BOM) text
The value 61 is the ascii value for ‘a’, so we know where to look. Note that this file is one byte shorter than the other two. This aligns with UTF-8. The letter ‘a’ maps directly to the ascii value for ‘a’, where as the Hebrew characters us the ascii expansion scheme. I’ll translate this raw binary back to the Unicode value:
If shin dumps to d7 a9
D7a9=1101 0111 1010 1001
110 Means it falls into the range of U+0080-U+07FF And is translated from the Unicode value by the pattern 110yyyxx 10xxxxxx
So yyy= 101 and xx xxxxxx = 11 101001
Note the yyy. The max this portion can be is 7, so you left pad it with 0 to get 0000 0101
And the rest becomes 1110 1001. This is the hex value 0 5 E 9, Which is in the range U+0590 to U+05FF as expected.