969 – Incorrect UTF-8 conversion for non-BMP characters

Read only archive ; use https://github.com/JacORB/JacORB/issues for new issues

Bug 969 - Incorrect UTF-8 conversion for non-BMP characters

Summary: Incorrect UTF-8 conversion for non-BMP characters

Status:	RESOLVED FIXED

Alias:	None

Product:	JacORB
Classification:	Unclassified
Component:	ORB (show other bugs)
Version:	3.3
Hardware:	PC Linux

Importance:	P5 enhancement
Assignee:	Mailinglist to track bugs

URL:

Depends on:
Blocks:

Reported:	2013-11-13 05:30 UTC by Peter Klotz
Modified:	2014-03-19 14:45 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Patch for UTF-8 conversion problem (22.44 KB, application/zip) 2014-02-17 10:05 UTC, Gotthard Witsch	Details
JUnit Test for CodeSet.write_string (1.19 KB, application/zip) 2014-03-14 11:45 UTC, Gotthard Witsch	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Klotz 2013-11-13 05:30:21 UTC

Methods read_char() and write_char() in class Utf8CodeSet convert data characterwise from/to UTF-8. This presents a problem when dealing with UTF-16 characters outside the Basic Multilingual Plane (all code points beyond U+FFFF). Here UTF-16 requires the use of surrogate pairs which means that two Java chars form a single character.

The following currently happens in JacORB 3.3 when sending Unicode Character U+1044F (DESERET SMALL LETTER EW, see http://www.fileformat.info/info/unicode/char/1044f/index.htm):

Java UTF-16 string: "\uD801\uDC4F"
Converted into UTF-8 and received by omniORB: "\xed\xa0\x81" "\xed\xb1\x8f"

The correct UTF-8 encoding would be: "\xf0\x90\x91\x8f"

So JacORB simply sees each surrogate as a character of its own and encodes it into UTF-8. This leads to 6 bytes whereas the correct encoding would be 4 byte in length.

To fix this, it would be necessary that JacORB no longer performs its conversion solely on Java char basis. The conversion classes should be able to handle Java strings. This would allow the conversion class to detect parts of surrogate pairs and convert them in a single step into the correct destination encoding.

Comment 1 Gotthard Witsch 2014-02-17 10:05:12 UTC

Created attachment 431 [details]
Patch for UTF-8 conversion problem

Comment 2 Gotthard Witsch 2014-02-17 10:16:53 UTC

To solve the conversion problem I attached a patch.
The following changes have been done:

In org.jacorb.org.CDROutputStream the conversion of the string is done by the methodcall codeSet.write_string.
Therefore the class org.jacorb.orb.giop.CodeSet received a new method write_string with the following signature:
 public void write_string( OutputBuffer buffer, String s, boolean write_bom, boolean write_length, int giop_minor ).
In it's default implementation it does the same as in CDROutputStream has been done earlier. Every character of the string is converted with the write_char method.
The inner class Utf8CodeSet overrides this methode an uses the String's getBytes(Charset charset) method to receive the necessary bytes for transmission. With the buffer's write_byte method the bytes are added to the buffer. The getBytes(Charset charset) is prefered to getBytes(String charsetName), as getBytes(String charsetName) does not specify what will happen if characters cannot be encoded.

Comment 3 Nick Cross 2014-02-18 16:21:48 UTC

Thanks for the patch. Do you have the tests you are using?

Comment 4 Gotthard Witsch 2014-03-14 11:45:34 UTC

Created attachment 432 [details]
JUnit Test for CodeSet.write_string

Comment 5 Gotthard Witsch 2014-03-14 11:46:28 UTC

I uploaded the tests for the new CodeSet.write_string method.

Comment 6 Nick Cross 2014-03-19 14:45:55 UTC

Thanks for the patch and test!

Fixed by https://github.com/JacORB/JacORB/pull/103