From: Mori, Nobuyoshi [[email protected]]
Sent: Tuesday, August 13, 2002 8:27 AM
To: 'Asmus Freytag'; Winkler, Arnold F; 'Thomas Plum'; 'John Benito'; 'Ann Bennett'; 'Tom Plum (WG21)'; 'Frank Farance'; 'John Hill'; 'Rex Jaeschke'; 'Keld Jørn Simonsen'; 'Willem Wakker'; 'Herb Sutter'
Cc: 'Matthew Deane'; 'Don Schricker'
Subject: RE: Documents for Character Set Ad-hoc (agenda time)

Thanks for the summary.  I'm afraid that I must have missed part of earlier communications.  The ad-hoc is a great chance for us to tidy up the historical confusions.  I therefore want to be sure, that most of the language related issues I have experienced in the last years are covered.  I am thankful for your patience, reading my tedious mails.

I assume that the central issue in the SC/22 is the interoperability of programming languages, and program specification is the matter of WGs.  There are various levels of interpretabilities:

  1. File and I/O level:        
    • I found XML support in almost all languages.
  2. Client Server communication level:
    • Each one of programming language: Cobol, Java, C/C++, etc are e.g. DB-Client and connected to a single DB-Server. 
    • Same as above, via RFC communication
    • etc
  3. Process level :     
    • Direct data exchange via shared memory or parameter.  e.g. Java performance optimization in C/C++. 
In order to have loss free communications,
not, however:
The confusion arises when using characters, the Unicode scalar value of which are U+10000 ~ U+10FFFF.  Those code points are represented by "11110xxx 10xxxxxx 10xxxxxx 10xxxxxx" in UTF-8 or [U+D800-U+DBFF, U+DC00-U+DFFF] in UTF-16.  There exist considerable number of languages, products, libraries, etc which treats those code points improperly also under the encoding scheme name "UTF-8" and "UTF-16".  I attach a consideration on Java we've done, when supporting Unicode code points; U+10000 ~ U+10FFFF.  The key point is, not only fixing the transcode, but also adding some new API and renunciation of a few language features is desirable.  I assume that similar consideration for other programming languages is possible. 
 
When we talk about full support Unicode, I personally believe that better support of string literals (including character literal) in all programming languages is important.  String literal has two aspects:
It is important that the string literal can be used for Unicode data types.  Encoding of string literal is normally limited to the intersection of program source character set and program execution character set, while the support of full execution character set is often required. C# specification supports Unicode in string literal, but limits the usage of  U+10000 ~ U+10FFFF, which I personally believe is a very pragmatic and good compromise. I wonder, whether we can do a computer language independent general consideration on the issue.  The problem is not easy to resolve even between C and C++ languages.  We had documented the current discussion status in http://anubis.dkuug.dk/JTC1/SC22/WG14/www/docs/n977.htm
1.2 in the document was discussed in WG21, more or less, as offline discussion, as was cited by Tom on our phone conference.  And so far I hadn't got any response to 3. in the paper from WG14 colleagues.  (which makes me feel a bit uneasy..).  The syntax of string literals are quite similar in most languages.  I had not succeed to list up the Unicode string literal support in other WG's. 
 
We wish to check the language specification, if
The core of interoperability is the loss free data exchange.  Loss free means that all the languages, Cobol, APL, Fortran, C/C++, etc should support Unicode, including U+10000 ~ U+10FFFF.  When it is the case, we can assume the loose interoperability.
 
The process level interpretatively requires the data types, the width of which is identical.  One question is, if we need such interoperability.  In the applications we build, we do need such interoperability among C/C++, C#, Java, Cobol, Visual Basic, etc.  Some languages are not in the scope of SC/22.   
 
One more issue I would like to add is, that we have made the computer languages quite complicated by additional Unicode based character sets and encoding. (encoding schemes)  One of the charm C#, Java, VB, etc have is the abandonment of 8 bit based characters.  It frees the not system oriented but more application oriented programmers from unnecessary complexity.  Many companies meet the decision to create
out of a single source and not to use 8 bit based character at all on a Unicode based product.  When a computer language have both features to support Unicode and non Unicode characters, it is desirable to have a subset of language features in order to program in clean Unicode or non Unicode environment.  A possibility to warn programmers when the intersection of Unicode and non Unicode  features is in use, can be very useful. 
 
A example in C would be:
char          c;
wchar_t    w;
 
c = w;                          /* a bug ? */
A possibility is to treat "c=w;" as a syntax error and replace it by wctomb() in a Unicode based application. (The example is not very nice because of compiler warning in various implementations.)   Would it be possible for WGs to define language subsets ?  
 
Are the subjects; language subsets, string literal and data type issues are not in the scope of ad-hoc ?  I'm happy that Jim Melton has registered in ad-hoc.  How could we position the client server communication problematic ?  Or can we address the questions to WGs ? 
 
Thanks and regards,
Nobu
 
Attachment:
    Surrogates and combining characters support in Java
 

 

Surrogates and Combining Characters in Java

Target

Build a new API for the String, StringBuf and Charachter class of the Java SDK that ensures that surrogates and combining characters are preserved (e.g. avoid cutting a string within such a character).

Design ideas

Example

old:

String componentId = id;
int i = id.indexOf('_');
if (i >= 0) {
componentId = id.substring(0, i);
}

new:

String componentId = Utf16Str.SplitBefore( id, '_');
if( componentId == NULL ) {
componentId = id;
}

Critical classes and methods

class Character: methods dealing with character properties

Example: 
bool Character.isLetter( char c )

Requirment: 
In order to handle surrogate pairs properly, an interface for 32-bit  characters (encoding UTF-32) is required.

Solution Approach:
class UCharacter of ICU4J offers such a 32-bit interface and should be used instead of JDK class Characters.

class String/StringBuf: extract single characters from string

Example: 
char String.charAt( int index )

Requirement:
A 16-bit return value is problematic if the 16-bit value is part of a surrogate pairs or part of a combining character sequence.

Solution Approach (depending on the programming context):

This alternatives can be offered via static methods and/or via a character iterator class.

class String: searching

Example: 
int String.indexOf( char c )

Requirement:
When a matching character or string is found, the character that immediately follows the matching character has to be checked, as well. If the matching sequence is immediately followed by a combining character, than it is not a valid match, because the combining character modifies the last character of the matching sequence.

From a practical point of view, this point is not really critical in most cases. We can distinguish 4 variants of searching:

Based on this considerations we think that we can ignore combining characters in low level searching and handle them on a higher level where necessary

 

class String/StringBuffer: extracting parts of a string

Example: 
String String.substring( int beginIndex, int endIndex)

Requirement:
When extracting parts from a string, avoid splitting surrogate pairs and Graphemes.

Solution Approach (depending on the programming context):

class StringBuffer: modifying parts of a string

Example: StringBuffer StringBuffer.replace( int beginIndex, int endIndex, String s )

The indices that mark the borders of the operation may not cut surrogate pairs and Graphemes. Principially the same approaches can be applied as for extracting parts from a string.

Rules and restrictions on strings that can be processed

Possible support by check tools

A check tool can detect and warn, if one of the critical operations is done.

It may be possible to avoid warnings, if the critical operations is used in a save context. This can be:

 


 



 

-----Original Message-----
From: Asmus Freytag [mailto:[email protected]]
Sent: Monday, August 12, 2002 7:37 PM
To: Mori, Nobuyoshi; 'Winkler, Arnold F'; 'Thomas Plum'; 'John Benito';
'Ann Bennett'; 'Tom Plum (WG21)'; 'Frank Farance'; 'John Hill'; 'Rex
Jaeschke'; 'Keld Jørn Simonsen'; 'Willem Wakker'; 'Herb Sutter'; Mori,
Nobuyoshi
Cc: 'Matthew Deane'; 'Don Schricker'
Subject: RE: Documents for Character Set Ad-hoc (agenda time)


Mr Mori has asked what the boundaries of discussion should be.
At Unicode we see three areas of discussion:

1 Representation of characters
   This is a datatype issue, and the discussion of
   the proposed UTF-16 support would fall here

2 Identifiers
   This is the issue Tom referred to. There are
   several strategies worth discussing

3 Character Properties / Processing
   If we can make progress with the first two, it
   might make sense to broaden the discussion to
   consider this issue.

We are working on creating a paper that we can submit
from the Unicode side to help with the discussion of
these three issues - that's why I have refrained
from doing more than list them here.

By accident of schedule, our next meeting is in the
week directly before the ad-hoc meeting; we will try
to come up with something earlier than that, if we can.

A./