Thanks for the summary. I'm afraid that I must have missed part of earlier communications. The ad-hoc is a great chance for us to tidy up the historical confusions. I therefore want to be sure, that most of the language related issues I have experienced in the last years are covered. I am thankful for your patience, reading my tedious mails.

I assume that the central issue in the SC/22 is the interoperability of programming languages, and program specification is the matter of WGs. There are various levels of interpretabilities:

Surrogates and Combining Characters in Java

Target

Build a new API for the String, StringBuf and Charachter class of the Java SDK that ensures that surrogates and combining characters are preserved (e.g. avoid cutting a string within such a character).

Design ideas

For each affected JDK class build a corresponding wrapper class.
Wrapper class contains static methods as alternative to depricated original methods.
Each Wrapper method takes ab object of the corresponding original class as parameter. The wrapper method works on this object (this object takes the role of the "this" pointer).
There is typically no 1:1 mapping between wrapper and original method. Depending on context and programmer's intention one of several methods have to be chosen. For example, String.charAt can mean, fetch a codeunit (16-bit), a Unicode character (32-bit) or a Grapheme (base character + following combining characters).
Often it is necessary to change not only a function call, but also the surrounding coding. Therefore we scan through the existing codes and search for patterns that occur repeatedly and describe how to rewrite the code
We offer higher level methods to support whole coding patterns with one wrapper method. This makes it easier to rewrite of existing code, documents the intention of the programmer and typically increases performance.

Example

old:

String componentId = id; int i = id.indexOf('_'); if (i >= 0) { componentId = id.substring(0, i); }

new:
String componentId = Utf16Str.SplitBefore( id, '_'); if( componentId == NULL ) { componentId = id; }

Critical classes and methods

class Character: methods dealing with character properties

Example:
bool Character.isLetter( char c )

Requirment:
In order to handle surrogate pairs properly, an interface for 32-bit characters (encoding UTF-32) is required.

Solution Approach:
class UCharacter of ICU4J offers such a 32-bit interface and should be used instead of JDK class Characters.

class String/StringBuf: extract single characters from string

Example:
char String.charAt( int index )

Requirement:
A 16-bit return value is problematic if the 16-bit value is part of a surrogate pairs or part of a combining character sequence.

Solution Approach (depending on the programming context):

continue working on 16-bit codeunits (following operations do not destroy surrogate pairs or composite character sequence)
work on 32-bit characters (e.g. this is necessary to check character properties)
work on Graphemes represented as strings

This alternatives can be offered via static methods and/or via a character iterator class.

class String: searching

Example:
int String.indexOf( char c )

Requirement:
When a matching character or string is found, the character that immediately follows the matching character has to be checked, as well. If the matching sequence is immediately followed by a combining character, than it is not a valid match, because the combining character modifies the last character of the matching sequence.

From a practical point of view, this point is not really critical in most cases. We can distinguish 4 variants of searching:

searching a delimiting character
delimiting characters are normally special characters (like <,>,= ...) to which combining characters can't be applied
searching a token
tokens are separated by delimiting characters. So no combining characters can follow
searching a prefix
theoretically it could be possible, that a prefix occurs with a following combining character, that invalidates the match, but this is very unlikely to be a problem
searching with like
here combinig characters are relevant, but anyway like needs more than low level string search routines

Based on this considerations we think that we can ignore combining characters in low level searching and handle them on a higher level where necessary

class String/StringBuffer: extracting parts of a string

Example:
String String.substring( int beginIndex, int endIndex)

Requirement:
When extracting parts from a string, avoid splitting surrogate pairs and Graphemes.

Solution Approach (depending on the programming context):

combine extraction with a preceding search operation (e.g. splitBefore, splitAfter, cutPrefix ...)
Completely remove from the result surrogate pairs and Graphemes that would be split (e.g. when storing strings in a buffer with limited size)
keep index access if the string has a fix format

class StringBuffer: modifying parts of a string

Example: StringBuffer StringBuffer.replace( int beginIndex, int endIndex, String s )

The indices that mark the borders of the operation may not cut surrogate pairs and Graphemes. Principially the same approaches can be applied as for extracting parts from a string.

Rules and restrictions on strings that can be processed

Strings may not contain unpaired surrogates. Unpaired surrogates may be skipped or replaced with another character at any time.
Corresponding to the W3C character model, strings must be normalized early. That means that we normalize strings immediately when they are entered by the user, and that we assume that strings are normalized when we receive them from other software. We use Normalization from C (canonical decompositon followed by canonical composition). Searching, sorting and idendity matching may not work with unnormalized strings.
Strings may not start with a combining character. Since string concatenation with such strings may result in unnormalized strings, searching, sorting and identity matching may not work.
Identifiers shall not contain format characters (e.g. to indicate writing direction from right to left). Format characters are not ignored when searching for identifiers.
In contexts where a character must be quoted, because it is used as delimiter, this character must also be quoted if it is used as a base character of a composite character sequence.

Possible support by check tools

A check tool can detect and warn, if one of the critical operations is done.

It may be possible to avoid warnings, if the critical operations is used in a save context. This can be:

extracting a part of a string with indices that come from previous search or stringlen operations
searching for a character which does not permit combining characters (e.g. \n)
extracting a character as a 16-bit value from a string and comparing this character to other characters that are neither surrogate pairs nor can be followed by combining characters

-----Original Message-----
From: Asmus Freytag [mailto:[email protected]]
Sent: Monday, August 12, 2002 7:37 PM
To: Mori, Nobuyoshi; 'Winkler, Arnold F'; 'Thomas Plum'; 'John Benito';
'Ann Bennett'; 'Tom Plum (WG21)'; 'Frank Farance'; 'John Hill'; 'Rex
Jaeschke'; 'Keld Jørn Simonsen'; 'Willem Wakker'; 'Herb Sutter'; Mori,
Nobuyoshi
Cc: 'Matthew Deane'; 'Don Schricker'
Subject: RE: Documents for Character Set Ad-hoc (agenda time)

Mr Mori has asked what the boundaries of discussion should be.
At Unicode we see three areas of discussion:

1 Representation of characters
   This is a datatype issue, and the discussion of
   the proposed UTF-16 support would fall here

2 Identifiers
   This is the issue Tom referred to. There are
   several strategies worth discussing

3 Character Properties / Processing
   If we can make progress with the first two, it
   might make sense to broaden the discussion to
   consider this issue.

We are working on creating a paper that we can submit
from the Unicode side to help with the discussion of
these three issues - that's why I have refrained
from doing more than list them here.

By accident of schedule, our next meeting is in the
week directly before the ad-hoc meeting; we will try
to come up with something earlier than that, if we can.

A./