Fun with Unicode

I’m on IRC with a Coworker in Israel. While my Hebrew is pretty poor, I have attempted to communicate short sayings with him like:

בוקר טוב

And of course

שלם

However, his response are getting garbled. I blame his IRC program (mIRC) since it is running Windows, but I’d like to confirm that this is our problem. Eventually I’ll sniff the packets on the wire, but what should I look for?

I took the word שלם and dumped it into open office, and then saved the file as straight text. A hex dump of it looks like this:

hexdump -C ~/Documents/shalom.txt
00000000 ef bb bf d7 a9 d7 9c d7 9d 0a |……….|

Just the letter ש looks like this

hexdump -C ~/Documents/shin.txt
00000000 ef bb bf d7 a9 0a |……|

A File with just the letter ‘a’ in it looks like this:

hexdump -C ~/Documents/a.txt
00000000 ef bb bf 61 0a |…a.|

To check the file type of these:

file ~/Documents/a.txt
/home/ayoung/Documents/a.txt: UTF-8 Unicode (with BOM) text

The value 61 is the ascii value for ‘a’, so we know where to look. Note that this file is one byte shorter than the other two. This aligns with UTF-8. The letter ‘a’ maps directly to the ascii value for ‘a’, where as the Hebrew characters us the ascii expansion scheme. I’ll translate this raw binary back to the Unicode value:

If shin dumps to d7 a9

D7a9=1101 0111 1010 1001

110 Means it falls into the range of U+0080-U+07FF And is translated from the Unicode value by the pattern 110yyyxx 10xxxxxx

So yyy= 101 and xx xxxxxx = 11 101001

Note the yyy. The max this portion can be is 7, so you left pad it with 0 to get 0000 0101

And the rest becomes 1110 1001. This is the hex value 0 5 E 9, Which is in the range U+0590 to U+05FF as expected.

C++ optimization for string16

Since wchar_t is 32 bits on Linux, I need to transform wstring to a different type in order to call the ODBC functions. The Windows code, on the other hand, can just use wstrings c_str() function to access the internal representation of the string. My goal is to minimize the code differences between platforms. On Linux, I will create a new class for string16 see earlier post). On Windows, I will just typedef wstring to string16. My hope then is to get the OS specific code down to the Linux implementation of string16, and to have the typedef optimize away the differences.

Here is a simplified version of the code that will be built and run on Windows:

typedef wstring& sqlstring;

void dothing(wstring s){
sqlstring sql(s);
wcout << sql << endl;
}

void doanother(wstring s){
wcout << s << endl;
}

And the result of building and disassembling using g++. Note that the code is the same. Hopefully Windos C++ complier will behave the same.

0000000000400ae0 <_Z9doanotherSbIwSt11char_traitsIwESaIwEE>:
400ae0: 48 83 ec 08 sub $0x8,%rsp
400ae4: 48 89 fe mov %rdi,%rsi
400ae7: bf c0 12 60 00 mov $0x6012c0,%edi
400aec: e8 07 fe ff ff callq 4008f8 <_ZStlsIwSt11char_traitsIwESaIwEERSt13basic_ostreamIT_T0_ES7_R
KSbIS4_S5_T1_E@plt>
400af1: 48 83 c4 08 add $0x8,%rsp
400af5: 48 89 c7 mov %rax,%rdi
400af8: e9 0b fe ff ff jmpq 400908 <_ZSt4endlIwSt11char_traitsIwEERSt13basic_ostreamIT_T0_ES6_@pl
t>
400afd: 90 nop
400afe: 66 90 xchg %ax,%ax

0000000000400b00 <_Z7dothingSbIwSt11char_traitsIwESaIwEE>:
400b00: 48 83 ec 08 sub $0x8,%rsp
400b04: 48 89 fe mov %rdi,%rsi
400b07: bf c0 12 60 00 mov $0x6012c0,%edi
400b0c: e8 e7 fd ff ff callq 4008f8 <_ZStlsIwSt11char_traitsIwESaIwEERSt13basic_ostreamIT_T0_ES7_R
KSbIS4_S5_T1_E@plt>
400b11: 48 83 c4 08 add $0x8,%rsp
400b15: 48 89 c7 mov %rax,%rdi
400b18: e9 eb fd ff ff jmpq 400908 <_ZSt4endlIwSt11char_traitsIwEERSt13basic_ostreamIT_T0_ES6_@pl
t>
400b1d: 90 nop
400b1e: 66 90 xchg %ax,%ax