Bug 2223

Summary: ORB CodeSet negotiate bug
Product: TAO Reporter: Jiang Wei <jiangwei_1976>
Component: ORBAssignee: Phil Mesnier <mesnierp>
Status: ASSIGNED ---    
Severity: enhancement    
Priority: P4    
Version: 1.4.6   
Hardware: All   
OS: All   
Attachments: Codeset lib for GB2312<==>UTF-8 using iconv
[ BUG Fixed ] Codeset lib for GB2312<==>UTF-8 using iconv
Updated to work with TAO 1.5.7

Description Jiang Wei 2005-08-26 07:15:26 CDT
two TAO ORBs (X and Y) , 
 X (ncs = UTF-8,ccs={GB2312} ) ; Y (ncs = GB2312,ccs={UTF-8} )

transmit string from X to Y, Exception : Marshal 

It seem that X using GB2312 as tcs ,but Y using UTF-8 as tcs.
Comment 1 Jiang Wei 2005-08-26 07:48:26 CDT
X on CentOS 4.1 ,Y on Windows 2000 
Comment 2 Phil Mesnier 2005-08-26 08:20:55 CDT
Hi,
 
First, the codeset negotiation feature of TAO was substantially changed after
1.4.6. In version 1.4.7, this code was pulled from the main TAO library and put
in a new library, libTAO_Codeset, and also includes a reference translator for
char data from UTF-8 to Latin_1 (ISO8859-1). I'm not familiar with GB2312, but
based on the hints given in your comment, I guess that GB2312 is compatible with
UTF-8. I also guess that you have a codeset translator to facilitate that
compatibility.
 
I'm surprised that you are seeing two different TCS values. Computation of the
TCS is a function of the client process. The server advertises its NCS and any
CCS in its profile, then the client computes the TCS by considering the server's
codesets and its own. The client then informs the server about the chosen TCS
with a service context attached to the first request. Therefore both sides
should use the same TCS.
 
I would like to see your translator(s) and service config files. Also, if you
would run both processess with -ORBDebuglevel 10, then send me the output of
both to just after sending/receiving the first GIOP message.
 
Thanks,
Phil
Comment 3 Jiang Wei 2005-08-26 09:19:34 CDT
X's svc.conf
dynamic Char_UTF8_GB2312_Factory Service_Object * 
UTF8_GB2312:_make_Char_UTF8_GB2312_Factory ()
static Resource_Factory  "-ORBNativeCharCodeSet UTF-8 -ORBNativeWcharCodeSet 
UTF-16 -ORBCharCodesetTranslator Char_UTF8_GB2312_Factory"

Y's svc.conf
dynamic Char_GB2312_UTF8_Factory Service_Object * 
GB2312_UTF8:_make_Char_GB2312_UTF8_Factory ()
static Resource_Factory "-ORBNativeCharCodeSet GB2312  -ORBNativeWcharCodeSet 
UTF-16 -ORBCharCodesetTranslator Char_GB2312_UTF8_Factory"
Comment 4 Jiang Wei 2005-08-26 09:24:13 CDT
#include "Char_GB2312_UTF8_Translator.h"
#include "ace/OS_Memory.h"
#include "ace/OS_NS_string.h"

GB2312_UTF8::GB2312_UTF8 (void)
{
  gu = iconv_open("UTF-8","GB2312");
  ug = iconv_open("GB2312","UTF-8");
}

GB2312_UTF8::~GB2312_UTF8 (void)
{
  iconv_close(gu);
  iconv_close(ug);
}

ACE_CDR::Boolean
GB2312_UTF8::read_char (ACE_InputCDR &in,
                            ACE_CDR::Char &x)
{
  if (this->read_1 (in, reinterpret_cast<ACE_CDR::Octet*> (&x)))
        return true;
  return 0;
}

ACE_CDR::Boolean
GB2312_UTF8::read_string (ACE_InputCDR& in,
                              ACE_CDR::Char *& x)
{
  ACE_CDR::ULong str_len,gb_len;
  in.read_ulong (str_len);

  if (str_len == 0) return false;
  gb_len = str_len * 2 ;

  ACE_CDR::Char *ubuf ,*gbuf;
  ACE_NEW_RETURN (gbuf,ACE_CDR::Char[gb_len],0);
  ubuf = gbuf + str_len;

  if (!this->read_char_array (in, ubuf, str_len))
  {
    delete [] gbuf;
    return false;
  }

  ACE_CDR::Char * d = gbuf;
  size_t r = iconv(ug,&ubuf,&str_len,&d,&gb_len);

  if (r == (size_t)(-1)) {
    delete [] gbuf ;
    return false;
  }

  if(gb_len) {
        d[0] = 0;
  } else {
    delete [] gbuf;
    return false;
  }

  x = gbuf;
  return true;
}

ACE_CDR::Boolean
GB2312_UTF8::read_char_array (ACE_InputCDR& in,
                                  ACE_CDR::Char* x,
                                  ACE_CDR::ULong len)
{
  if (this->read_array (in,
                        x,
                        ACE_CDR::OCTET_SIZE,
                        ACE_CDR::OCTET_ALIGN,
                        len))
    {
      return 1;
    }

  return 0;
}

ACE_CDR::Boolean
GB2312_UTF8::write_char (ACE_OutputCDR& out,
                             ACE_CDR::Char x)
{
  return this->write_1 (out,
                        reinterpret_cast<const ACE_CDR::Octet*> (&x));
}

ACE_CDR::Boolean
GB2312_UTF8::write_string (ACE_OutputCDR& out,
                               ACE_CDR::ULong len,
                               const ACE_CDR::Char* x)
{
  ACE_CDR::Char * ubuf;
  ACE_CDR::ULong ub_len = len * 2;
  ACE_NEW_RETURN (ubuf,ACE_CDR::Char[ub_len],0);

  ACE_CDR::Char * src = ubuf + len;//const_cast<ACE_CDR::Char*>(x);
  ACE_OS::memcpy (src,x,len);
  ACE_CDR::Char * dst = ubuf;
  size_t r = iconv(gu,&src,&len,&dst,&ub_len);

  size_t real_len = dst - ubuf + 1;

  if (r != (size_t)(-1) && out.write_ulong (real_len))
  {
      dst[0] = 0;

      ACE_CDR::Char * buf;

      if (this->adjust (out, real_len, 1, buf))
      {
        this->good_bit(out, 0);
        delete [] ubuf;
        return false;
      }

      ACE_OS::memcpy (buf,ubuf,real_len);
      delete [] ubuf;
      return true;
  }

  delete [] ubuf;
  return false;
}

ACE_CDR::Boolean
GB2312_UTF8::write_char_array (ACE_OutputCDR& out,
                                   const ACE_CDR::Char* x,
                                   ACE_CDR::ULong len)
{
  char *buf;

  if (this->adjust (out, len, 1, buf) == 0)
    {
      ACE_OS::memcpy (buf, x, len);

      return 1;
    }

  this->good_bit(out, 0);
  return 0;
}


/***************************************************************************
 ***************************************************************************/

#include "Char_UTF8_GB2312_Translator.h"
#include "ace/OS_Memory.h"
#include "ace/OS_NS_string.h"

UTF8_GB2312::UTF8_GB2312 (void)
{
  gu = iconv_open("UTF-8","GB2312");
  ug = iconv_open("GB2312","UTF-8");
}

UTF8_GB2312::~UTF8_GB2312 (void)
{
  iconv_close(gu);
  iconv_close(ug);
}

ACE_CDR::Boolean
UTF8_GB2312::read_char (ACE_InputCDR &in,
                            ACE_CDR::Char &x)
{
  if (this->read_1 (in, reinterpret_cast<ACE_CDR::Octet*> (&x)))
        return true;
  return 0;
}

ACE_CDR::Boolean
UTF8_GB2312::read_string (ACE_InputCDR& in,
                              ACE_CDR::Char *& x)
{
  ACE_CDR::ULong from_len,to_len;
  in.read_ulong (from_len);

  if (from_len == 0) return false;
  to_len = from_len * 2 ;//足够大了

  ACE_CDR::Char *to_buf ,*from_buf;
  ACE_NEW_RETURN (to_buf,ACE_CDR::Char[to_len],0);
  from_buf = to_buf + from_len;//共用一块内存,减少分配内存操作

  if (!this->read_char_array (in, from_buf, from_len))
  {
    delete [] to_buf;
    return false;
  }

  ACE_CDR::Char * dest = to_buf;
  size_t r = iconv(gu,&from_buf,&from_len,&dest,&to_len);//overwrite from_buf

  if (r == (size_t)(-1)) {
    delete [] to_buf ;
    return false;
  }

  if(to_len) {//应该剩下很多空间
    dest[0] = 0;
  } else {
    delete [] to_buf;
    return false;
  }

  x = to_buf;//直接赋值,浪费内存空间换取性能
  return true;
}

ACE_CDR::Boolean
UTF8_GB2312::read_char_array (ACE_InputCDR& in,
                                  ACE_CDR::Char* x,
                                  ACE_CDR::ULong len)
{
  if (this->read_array (in,
                        x,
                        ACE_CDR::OCTET_SIZE,
                        ACE_CDR::OCTET_ALIGN,
                        len))
    {
      return 1;
    }

  return 0;
}

ACE_CDR::Boolean
UTF8_GB2312::write_char (ACE_OutputCDR& out,
                             ACE_CDR::Char x)
{
  return this->write_1 (out,
                        reinterpret_cast<const ACE_CDR::Octet*> (&x));
}

ACE_CDR::Boolean
UTF8_GB2312::write_string (ACE_OutputCDR& out,
                               ACE_CDR::ULong len,
                               const ACE_CDR::Char* x)
{
  ACE_CDR::Char * to_buf;
  ACE_CDR::ULong to_len = len * 2;//不少了
  ACE_NEW_RETURN (to_buf,ACE_CDR::Char[to_len],0);

  ACE_CDR::Char * src = to_buf + len;//const_cast<ACE_CDR::Char*>(x);
  ACE_OS::memcpy (src,x,len);
  ACE_CDR::Char * dst = to_buf;
  size_t r = iconv(ug,&src,&len,&dst,&to_len);

  size_t real_len = dst - to_buf + 1;//多写一个0

  if (r != (size_t)(-1) && out.write_ulong (real_len))
  {
    dst[0] = 0;
    ACE_CDR::Char * buf;

    if (this->adjust (out, real_len, 1, buf))
    {
      this->good_bit(out, 0);
      delete [] to_buf;
      return false;
    }

    ACE_OS::memcpy (buf,to_buf,real_len);
    delete [] to_buf;
    return true;
  }

  delete [] to_buf;
  return false;
}

ACE_CDR::Boolean
UTF8_GB2312::write_char_array (ACE_OutputCDR& out,
                                   const ACE_CDR::Char* x,
                                   ACE_CDR::ULong len)
{
  char *buf;

  if (this->adjust (out, len, 1, buf) == 0)
    {
      ACE_OS::memcpy (buf, x, len);

      return 1;
    }

  this->good_bit(out, 0);
  return 0;
}
Comment 5 Jiang Wei 2005-08-26 09:29:39 CDT
SUCCESS version:

  size_t real_len = dst - to_buf + 1;

  if (r != (size_t)(-1) && out.write_ulong (real_len))
  {
    dst[0] = 0;


--------------------------------------------------------
MARSHAL Exception version:

  size_t real_len = dst - to_buf;
  if (r != (size_t)(-1) && out.write_ulong (real_len))
  {
    //dst[0] = 0;

Comment 6 Phil Mesnier 2005-08-26 11:11:00 CDT
Ok. It looks like you found your solution.

What happens is that with your configuration, only one side is actually using a
translator, since the other side gets to use its native codeset. The
non-translated side assumes the marshaled string includes an embedded NUL. Your
change to the write_string algorithm made it conform with this assumption.

I don't see anything that is necessarily wrong with TAO and the codeset
negotiation feature.

Let me know if you have other conserns, otherwise I'll close this bug.

Regards,
Phil 
Comment 7 Nanbor Wang 2005-08-26 23:33:58 CDT
Jiang, can you please think of donating the code for GB2312 to TAO?
Comment 8 Phil Mesnier 2005-09-08 12:11:41 CDT
I'm resolving this bug as it appears the solution did not involve a change to TAO. 

I think the translator code in this bug shows a nice way to use iconv as basis
for a more generic translator implementation.
Comment 9 Jiang Wei 2005-09-09 04:06:56 CDT
I was write a LwLog Service .
Server save log-string in Berkeley DB using UTF-8  CodeSet. 

server svc.conf :
dynamic Char_UTF8_GB2312_Factory Service_Object *
/usr/lib/acetao/libUTF8_GB2312.so:_make_Char_UTF8_GB2312_Factory ()

static Resource_Factory "-ORBNativeCharCodeSet UTF-8 -ORBCharCodesetTranslator
Char_UTF8_GB2312_Factory -ORBNativeWcharCodeSet UTF-16"
-------------------------END SVC.CONF---------------------------------------


ClientAsvc.conf :
static Resource_Factory "-ORBNativeCharCodeSet GB2312 -ORBNativeWcharCodeSet UTF-16"
-------------------------END SVC.CONF---------------------------------------

ClientBsvc.conf :
dynamic Char_GB2312_UTF8_Factory Service_Object *
GB2312_UTF8:_make_Char_GB2312_UTF8_Factory ()

static Resource_Factory "-ORBNativeCharCodeSet GB2312 -ORBCharCodesetTranslator
Char_GB2312_UTF8_Factory -ORBNativeWcharCodeSet UTF-16"
-------------------------END SVC.CONF---------------------------------------




run client using ClientAsvc.conf , SUCCESS !
run client using ClientBsvc.conf , FAILURE !  'CORBA::MARSHAL'

I debug Char_GB2312_UTF8_Translator.cpp.
in function "ACE_CDR::Boolean GB2312_UTF8::read_string (ACE_InputCDR&
in,ACE_CDR::Char *& x)" , read string from ACE_InputCDR(line 1) and print to
screen(line 4), It's 'GB2312' encode string! It seem server already convert the
string from UTF-8 to GB2312.  so following call "iconv" fail (line 9) , throw
MARSHAL exception.


//-----------------------
  ....
1   if (!this->read_array
(in,ubuf,ACE_CDR::OCTET_SIZE,ACE_CDR::OCTET_ALIGN,str_len))
2     return false;
3
4   std::cout << ubuf  << std::endl; /// FOR DEBUG.
5 
6  #if defined (ACE_WIN32)
7   size_t r = iconv(utf8_to_gb2312,(const char**)&ubuf,&str_len,&gbuf,&gb_len);
8  #else
9   size_t r = iconv(utf8_to_gb2312,&ubuf,&str_len,&gbuf,&gb_len);
10 #endif
11
12  if(r == (size_t)-1)
13    return false;

  ......
Comment 10 Jiang Wei 2005-09-09 07:42:45 CDT
>Jiang, can you please think of donating the code for GB2312 to TAO?
It's a pleasure for me

full source at 
http://us.f2.yahoofs.com/bc/40c29551_1852d/bc/codeset/codeset.tar.gz?
bfDYYIDBKTF56DFC
Comment 11 Johnny Willemsen 2005-10-13 09:48:38 CDT
Could you maybe add the code as attachment to this bugzilla entry?
Comment 12 Jiang Wei 2005-12-11 22:52:32 CST
Created attachment 421 [details]
Codeset lib for GB2312<==>UTF-8 using iconv
Comment 13 Jiang Wei 2005-12-15 21:36:23 CST
Created attachment 428 [details]
[ BUG Fixed ] Codeset lib for GB2312<==>UTF-8 using iconv
Comment 14 Chad Elliott 2007-04-03 12:34:32 CDT
Created attachment 716 [details]
Updated to work with TAO 1.5.7
Comment 15 Chad Elliott 2007-04-03 12:35:16 CDT
I'm not sure that this code is appropriate for inclusion into TAO.  It relies on
iconv() which is not available on most platforms.
Comment 16 Johnny Willemsen 2007-04-09 02:41:50 CDT
what about adding an iconv mpc base feature, disabled by default, enabled by the
user?
Comment 17 Chad Elliott 2007-04-13 06:58:37 CDT
Ok, but where would this go?  I'm guessing with the Codeset library?
Comment 18 Johnny Willemsen 2007-04-13 07:01:32 CDT
codeset seems the best place
Comment 19 Phil Mesnier 2007-04-13 08:00:31 CDT
I would put a specialized translator like this into its own library rather than piling on to the base codeset library. Maybe tao/Codeset/GB2312_UTF8. This way it could be dynamically loaded by only those people who need this.

Since this translator is iconv based, it might make sense to make a more general translator that takes the NCS and CCS values as translator factory parameters rather than having them hardwired in. This way it could be reused for a lot more codeset conversions. This I would put in a directory tao/Codeset/iconv and use a svc.conf directive such as:

dynamic Char_Iconv_Factory Service_Object * TAO_IconvTranslator::_make_char_iconv_factory() "ncs=UTF-8, ccs={GB2312}"

The drawback I see to this is that the factory has to be declared on the resource factory arg list, and I don't know how to differentiate multiple instances of Char_Iconv_Factory, if necessary.

It might be possible to get around that by allowing the translator factory to be responsible for creating a multitude of translators. Probably a simpler way would be to add a lightweight factory "skin" to the iconv translator, so that in the tao/Codeset/iconv directory we could add a specific factory instances, such as the UTF-8 -> GB3212. This would loaded by modifying the above directive to be:

dynamic Char_Iconv_Factory Service_Object * TAO_IconvTranslator::_make_utf8_gb3212_factory()
Comment 20 Chad Elliott 2007-04-13 08:25:34 CDT
Ok.  Since the original bug report was an issue in the users code, I am changing this to an enhancement instead of a "major" bug.