Summary: | ORB CodeSet negotiate bug | ||
---|---|---|---|
Product: | TAO | Reporter: | Jiang Wei <jiangwei_1976> |
Component: | ORB | Assignee: | Phil Mesnier <mesnierp> |
Status: | ASSIGNED --- | ||
Severity: | enhancement | ||
Priority: | P4 | ||
Version: | 1.4.6 | ||
Hardware: | All | ||
OS: | All | ||
Attachments: |
Codeset lib for GB2312<==>UTF-8 using iconv
[ BUG Fixed ] Codeset lib for GB2312<==>UTF-8 using iconv Updated to work with TAO 1.5.7 |
Description
Jiang Wei
2005-08-26 07:15:26 CDT
X on CentOS 4.1 ,Y on Windows 2000 Hi, First, the codeset negotiation feature of TAO was substantially changed after 1.4.6. In version 1.4.7, this code was pulled from the main TAO library and put in a new library, libTAO_Codeset, and also includes a reference translator for char data from UTF-8 to Latin_1 (ISO8859-1). I'm not familiar with GB2312, but based on the hints given in your comment, I guess that GB2312 is compatible with UTF-8. I also guess that you have a codeset translator to facilitate that compatibility. I'm surprised that you are seeing two different TCS values. Computation of the TCS is a function of the client process. The server advertises its NCS and any CCS in its profile, then the client computes the TCS by considering the server's codesets and its own. The client then informs the server about the chosen TCS with a service context attached to the first request. Therefore both sides should use the same TCS. I would like to see your translator(s) and service config files. Also, if you would run both processess with -ORBDebuglevel 10, then send me the output of both to just after sending/receiving the first GIOP message. Thanks, Phil X's svc.conf dynamic Char_UTF8_GB2312_Factory Service_Object * UTF8_GB2312:_make_Char_UTF8_GB2312_Factory () static Resource_Factory "-ORBNativeCharCodeSet UTF-8 -ORBNativeWcharCodeSet UTF-16 -ORBCharCodesetTranslator Char_UTF8_GB2312_Factory" Y's svc.conf dynamic Char_GB2312_UTF8_Factory Service_Object * GB2312_UTF8:_make_Char_GB2312_UTF8_Factory () static Resource_Factory "-ORBNativeCharCodeSet GB2312 -ORBNativeWcharCodeSet UTF-16 -ORBCharCodesetTranslator Char_GB2312_UTF8_Factory" #include "Char_GB2312_UTF8_Translator.h" #include "ace/OS_Memory.h" #include "ace/OS_NS_string.h" GB2312_UTF8::GB2312_UTF8 (void) { gu = iconv_open("UTF-8","GB2312"); ug = iconv_open("GB2312","UTF-8"); } GB2312_UTF8::~GB2312_UTF8 (void) { iconv_close(gu); iconv_close(ug); } ACE_CDR::Boolean GB2312_UTF8::read_char (ACE_InputCDR &in, ACE_CDR::Char &x) { if (this->read_1 (in, reinterpret_cast<ACE_CDR::Octet*> (&x))) return true; return 0; } ACE_CDR::Boolean GB2312_UTF8::read_string (ACE_InputCDR& in, ACE_CDR::Char *& x) { ACE_CDR::ULong str_len,gb_len; in.read_ulong (str_len); if (str_len == 0) return false; gb_len = str_len * 2 ; ACE_CDR::Char *ubuf ,*gbuf; ACE_NEW_RETURN (gbuf,ACE_CDR::Char[gb_len],0); ubuf = gbuf + str_len; if (!this->read_char_array (in, ubuf, str_len)) { delete [] gbuf; return false; } ACE_CDR::Char * d = gbuf; size_t r = iconv(ug,&ubuf,&str_len,&d,&gb_len); if (r == (size_t)(-1)) { delete [] gbuf ; return false; } if(gb_len) { d[0] = 0; } else { delete [] gbuf; return false; } x = gbuf; return true; } ACE_CDR::Boolean GB2312_UTF8::read_char_array (ACE_InputCDR& in, ACE_CDR::Char* x, ACE_CDR::ULong len) { if (this->read_array (in, x, ACE_CDR::OCTET_SIZE, ACE_CDR::OCTET_ALIGN, len)) { return 1; } return 0; } ACE_CDR::Boolean GB2312_UTF8::write_char (ACE_OutputCDR& out, ACE_CDR::Char x) { return this->write_1 (out, reinterpret_cast<const ACE_CDR::Octet*> (&x)); } ACE_CDR::Boolean GB2312_UTF8::write_string (ACE_OutputCDR& out, ACE_CDR::ULong len, const ACE_CDR::Char* x) { ACE_CDR::Char * ubuf; ACE_CDR::ULong ub_len = len * 2; ACE_NEW_RETURN (ubuf,ACE_CDR::Char[ub_len],0); ACE_CDR::Char * src = ubuf + len;//const_cast<ACE_CDR::Char*>(x); ACE_OS::memcpy (src,x,len); ACE_CDR::Char * dst = ubuf; size_t r = iconv(gu,&src,&len,&dst,&ub_len); size_t real_len = dst - ubuf + 1; if (r != (size_t)(-1) && out.write_ulong (real_len)) { dst[0] = 0; ACE_CDR::Char * buf; if (this->adjust (out, real_len, 1, buf)) { this->good_bit(out, 0); delete [] ubuf; return false; } ACE_OS::memcpy (buf,ubuf,real_len); delete [] ubuf; return true; } delete [] ubuf; return false; } ACE_CDR::Boolean GB2312_UTF8::write_char_array (ACE_OutputCDR& out, const ACE_CDR::Char* x, ACE_CDR::ULong len) { char *buf; if (this->adjust (out, len, 1, buf) == 0) { ACE_OS::memcpy (buf, x, len); return 1; } this->good_bit(out, 0); return 0; } /*************************************************************************** ***************************************************************************/ #include "Char_UTF8_GB2312_Translator.h" #include "ace/OS_Memory.h" #include "ace/OS_NS_string.h" UTF8_GB2312::UTF8_GB2312 (void) { gu = iconv_open("UTF-8","GB2312"); ug = iconv_open("GB2312","UTF-8"); } UTF8_GB2312::~UTF8_GB2312 (void) { iconv_close(gu); iconv_close(ug); } ACE_CDR::Boolean UTF8_GB2312::read_char (ACE_InputCDR &in, ACE_CDR::Char &x) { if (this->read_1 (in, reinterpret_cast<ACE_CDR::Octet*> (&x))) return true; return 0; } ACE_CDR::Boolean UTF8_GB2312::read_string (ACE_InputCDR& in, ACE_CDR::Char *& x) { ACE_CDR::ULong from_len,to_len; in.read_ulong (from_len); if (from_len == 0) return false; to_len = from_len * 2 ;//足够大了 ACE_CDR::Char *to_buf ,*from_buf; ACE_NEW_RETURN (to_buf,ACE_CDR::Char[to_len],0); from_buf = to_buf + from_len;//共用一块内存,减少分配内存操作 if (!this->read_char_array (in, from_buf, from_len)) { delete [] to_buf; return false; } ACE_CDR::Char * dest = to_buf; size_t r = iconv(gu,&from_buf,&from_len,&dest,&to_len);//overwrite from_buf if (r == (size_t)(-1)) { delete [] to_buf ; return false; } if(to_len) {//应该剩下很多空间 dest[0] = 0; } else { delete [] to_buf; return false; } x = to_buf;//直接赋值,浪费内存空间换取性能 return true; } ACE_CDR::Boolean UTF8_GB2312::read_char_array (ACE_InputCDR& in, ACE_CDR::Char* x, ACE_CDR::ULong len) { if (this->read_array (in, x, ACE_CDR::OCTET_SIZE, ACE_CDR::OCTET_ALIGN, len)) { return 1; } return 0; } ACE_CDR::Boolean UTF8_GB2312::write_char (ACE_OutputCDR& out, ACE_CDR::Char x) { return this->write_1 (out, reinterpret_cast<const ACE_CDR::Octet*> (&x)); } ACE_CDR::Boolean UTF8_GB2312::write_string (ACE_OutputCDR& out, ACE_CDR::ULong len, const ACE_CDR::Char* x) { ACE_CDR::Char * to_buf; ACE_CDR::ULong to_len = len * 2;//不少了 ACE_NEW_RETURN (to_buf,ACE_CDR::Char[to_len],0); ACE_CDR::Char * src = to_buf + len;//const_cast<ACE_CDR::Char*>(x); ACE_OS::memcpy (src,x,len); ACE_CDR::Char * dst = to_buf; size_t r = iconv(ug,&src,&len,&dst,&to_len); size_t real_len = dst - to_buf + 1;//多写一个0 if (r != (size_t)(-1) && out.write_ulong (real_len)) { dst[0] = 0; ACE_CDR::Char * buf; if (this->adjust (out, real_len, 1, buf)) { this->good_bit(out, 0); delete [] to_buf; return false; } ACE_OS::memcpy (buf,to_buf,real_len); delete [] to_buf; return true; } delete [] to_buf; return false; } ACE_CDR::Boolean UTF8_GB2312::write_char_array (ACE_OutputCDR& out, const ACE_CDR::Char* x, ACE_CDR::ULong len) { char *buf; if (this->adjust (out, len, 1, buf) == 0) { ACE_OS::memcpy (buf, x, len); return 1; } this->good_bit(out, 0); return 0; } SUCCESS version: size_t real_len = dst - to_buf + 1; if (r != (size_t)(-1) && out.write_ulong (real_len)) { dst[0] = 0; -------------------------------------------------------- MARSHAL Exception version: size_t real_len = dst - to_buf; if (r != (size_t)(-1) && out.write_ulong (real_len)) { //dst[0] = 0; Ok. It looks like you found your solution. What happens is that with your configuration, only one side is actually using a translator, since the other side gets to use its native codeset. The non-translated side assumes the marshaled string includes an embedded NUL. Your change to the write_string algorithm made it conform with this assumption. I don't see anything that is necessarily wrong with TAO and the codeset negotiation feature. Let me know if you have other conserns, otherwise I'll close this bug. Regards, Phil Jiang, can you please think of donating the code for GB2312 to TAO? I'm resolving this bug as it appears the solution did not involve a change to TAO. I think the translator code in this bug shows a nice way to use iconv as basis for a more generic translator implementation. I was write a LwLog Service . Server save log-string in Berkeley DB using UTF-8 CodeSet. server svc.conf : dynamic Char_UTF8_GB2312_Factory Service_Object * /usr/lib/acetao/libUTF8_GB2312.so:_make_Char_UTF8_GB2312_Factory () static Resource_Factory "-ORBNativeCharCodeSet UTF-8 -ORBCharCodesetTranslator Char_UTF8_GB2312_Factory -ORBNativeWcharCodeSet UTF-16" -------------------------END SVC.CONF--------------------------------------- ClientAsvc.conf : static Resource_Factory "-ORBNativeCharCodeSet GB2312 -ORBNativeWcharCodeSet UTF-16" -------------------------END SVC.CONF--------------------------------------- ClientBsvc.conf : dynamic Char_GB2312_UTF8_Factory Service_Object * GB2312_UTF8:_make_Char_GB2312_UTF8_Factory () static Resource_Factory "-ORBNativeCharCodeSet GB2312 -ORBCharCodesetTranslator Char_GB2312_UTF8_Factory -ORBNativeWcharCodeSet UTF-16" -------------------------END SVC.CONF--------------------------------------- run client using ClientAsvc.conf , SUCCESS ! run client using ClientBsvc.conf , FAILURE ! 'CORBA::MARSHAL' I debug Char_GB2312_UTF8_Translator.cpp. in function "ACE_CDR::Boolean GB2312_UTF8::read_string (ACE_InputCDR& in,ACE_CDR::Char *& x)" , read string from ACE_InputCDR(line 1) and print to screen(line 4), It's 'GB2312' encode string! It seem server already convert the string from UTF-8 to GB2312. so following call "iconv" fail (line 9) , throw MARSHAL exception. //----------------------- .... 1 if (!this->read_array (in,ubuf,ACE_CDR::OCTET_SIZE,ACE_CDR::OCTET_ALIGN,str_len)) 2 return false; 3 4 std::cout << ubuf << std::endl; /// FOR DEBUG. 5 6 #if defined (ACE_WIN32) 7 size_t r = iconv(utf8_to_gb2312,(const char**)&ubuf,&str_len,&gbuf,&gb_len); 8 #else 9 size_t r = iconv(utf8_to_gb2312,&ubuf,&str_len,&gbuf,&gb_len); 10 #endif 11 12 if(r == (size_t)-1) 13 return false; ...... >Jiang, can you please think of donating the code for GB2312 to TAO? It's a pleasure for me full source at http://us.f2.yahoofs.com/bc/40c29551_1852d/bc/codeset/codeset.tar.gz? bfDYYIDBKTF56DFC Could you maybe add the code as attachment to this bugzilla entry? Created attachment 421 [details]
Codeset lib for GB2312<==>UTF-8 using iconv
Created attachment 428 [details]
[ BUG Fixed ] Codeset lib for GB2312<==>UTF-8 using iconv
Created attachment 716 [details]
Updated to work with TAO 1.5.7
I'm not sure that this code is appropriate for inclusion into TAO. It relies on iconv() which is not available on most platforms. what about adding an iconv mpc base feature, disabled by default, enabled by the user? Ok, but where would this go? I'm guessing with the Codeset library? codeset seems the best place I would put a specialized translator like this into its own library rather than piling on to the base codeset library. Maybe tao/Codeset/GB2312_UTF8. This way it could be dynamically loaded by only those people who need this. Since this translator is iconv based, it might make sense to make a more general translator that takes the NCS and CCS values as translator factory parameters rather than having them hardwired in. This way it could be reused for a lot more codeset conversions. This I would put in a directory tao/Codeset/iconv and use a svc.conf directive such as: dynamic Char_Iconv_Factory Service_Object * TAO_IconvTranslator::_make_char_iconv_factory() "ncs=UTF-8, ccs={GB2312}" The drawback I see to this is that the factory has to be declared on the resource factory arg list, and I don't know how to differentiate multiple instances of Char_Iconv_Factory, if necessary. It might be possible to get around that by allowing the translator factory to be responsible for creating a multitude of translators. Probably a simpler way would be to add a lightweight factory "skin" to the iconv translator, so that in the tao/Codeset/iconv directory we could add a specific factory instances, such as the UTF-8 -> GB3212. This would loaded by modifying the above directive to be: dynamic Char_Iconv_Factory Service_Object * TAO_IconvTranslator::_make_utf8_gb3212_factory() Ok. Since the original bug report was an issue in the users code, I am changing this to an enhancement instead of a "major" bug. |