- File and I/O level:
- I found XML support in almost all languages.
- Client Server communication level:
- Each one of programming language: Cobol, Java, C/C++, etc are e.g.
DB-Client and connected to a single DB-Server.
- Same as above, via RFC communication
- etc
- Process level :
- Direct data exchange via shared memory or parameter. e.g. Java
performance optimization in C/C++.
In order to have loss free communications,
- the communication media must be loss free (XML e.g fulfills it)
- all communication partners must support the latest Unicode Standard, the
encoding scheme of which can be:
- UTF-8
- UTF-16
- UTF-32
- eventually also CESU-8
not, however:
The confusion arises when using characters, the Unicode
scalar value of which are U+10000 ~ U+10FFFF. Those code points are
represented by "11110xxx 10xxxxxx 10xxxxxx 10xxxxxx" in UTF-8
or [U+D800-U+DBFF, U+DC00-U+DFFF] in UTF-16. There
exist considerable number of languages, products, libraries,
etc which treats those code points improperly also under the encoding
scheme name "UTF-8" and "UTF-16". I attach a consideration on
Java we've done, when supporting Unicode code points; U+10000 ~
U+10FFFF. The key point is, not only fixing the transcode, but also
adding some new API and renunciation of a few language features is
desirable. I assume that similar consideration for other
programming languages is possible.
When we talk about full support Unicode, I personally believe that better
support of string literals (including character literal) in all programming
languages is important. String literal has two aspects:
It is important that the string literal can be used for Unicode data
types. Encoding of string literal is normally limited to the intersection
of program source character set and program execution character set, while the
support of full execution character set is often required. C# specification
supports Unicode in string literal, but limits the usage
of U+10000 ~ U+10FFFF, which I personally believe is a
very pragmatic and good compromise. I wonder, whether we can do a
computer language independent general consideration on the issue. The
problem is not easy to resolve even between C and C++ languages.
We had documented the current discussion status in
http://anubis.dkuug.dk/JTC1/SC22/WG14/www/docs/n977.htm
1.2 in the document was discussed in WG21, more or less, as offline
discussion, as was cited by Tom on our phone conference. And so far I
hadn't got any response to 3. in the paper from WG14 colleagues. (which
makes me feel a bit uneasy..). The syntax of string literals are quite
similar in most languages. I had not succeed to list up the Unicode string
literal support in other WG's.
We wish to check the language specification, if
- full featured Unicode support is possible and
- interoperability of programming languages are given
The core of interoperability is the loss free data
exchange. Loss free means that all the languages, Cobol, APL,
Fortran, C/C++, etc should support Unicode, including U+10000 ~
U+10FFFF. When it is the case, we can assume
the loose interoperability.
The process level interpretatively requires the data types, the width of
which is identical. One question is, if we need such
interoperability. In the applications we build, we do need such
interoperability among C/C++, C#, Java, Cobol, Visual Basic, etc. Some
languages are not in the scope of SC/22.
One more issue I would like to add is, that we have made the computer
languages quite complicated by additional Unicode based character sets and
encoding. (encoding schemes) One of the charm C#, Java, VB, etc have
is the abandonment of 8 bit based characters. It frees the not system
oriented but more application oriented programmers from unnecessary
complexity. Many companies meet the decision to create
- Unicode based product
- non Unicode based product
out of a single source and not to use 8 bit based character at all on a
Unicode based product. When a computer language have both features to
support Unicode and non Unicode characters, it is desirable to have a
subset of language features in order to program in clean Unicode or non Unicode
environment. A possibility to warn programmers when the
intersection of Unicode and non Unicode features is in use, can be
very useful.
A example in C would be:
char c;
wchar_t w;
c =
w;
/* a bug ? */
A possibility is to treat "c=w;" as a syntax error
and replace it by wctomb() in a Unicode based application. (The
example is not very nice because of compiler warning in various
implementations.) Would it be possible for WGs to define
language subsets ?
Are the subjects; language subsets, string literal and data type
issues are not in the scope of ad-hoc ? I'm happy that Jim Melton has
registered in ad-hoc. How could we position the client server
communication problematic ? Or can we address the questions to WGs ?
Thanks and regards,
Nobu
Attachment:
Surrogates and combining characters support in
Java
Surrogates and Combining Characters in Java
Target
Build a new API for the String, StringBuf and Charachter class of the Java
SDK that ensures that surrogates and combining characters are preserved (e.g.
avoid cutting a string within such a character).
Design ideas
- For each affected JDK class build a corresponding wrapper class.
- Wrapper class contains static methods as alternative to depricated
original methods.
- Each Wrapper method takes ab object of the corresponding original class as
parameter. The wrapper method works on this object (this object takes the role
of the "this" pointer).
- There is typically no 1:1 mapping between wrapper and original method.
Depending on context and programmer's intention one of several methods have to
be chosen. For example, String.charAt can mean, fetch a codeunit (16-bit), a
Unicode character (32-bit) or a Grapheme (base character + following combining
characters).
- Often it is necessary to change not only a function call, but also the
surrounding coding. Therefore we scan through the existing codes and search
for patterns that occur repeatedly and describe how to rewrite the code
- We offer higher level methods to support whole coding patterns with one
wrapper method. This makes it easier to rewrite of existing code, documents
the intention of the programmer and typically increases performance.
Example
old:
String componentId = id;
int i = id.indexOf('_');
if (i >= 0)
{
componentId = id.substring(0, i);
}
new:
String componentId = Utf16Str.SplitBefore( id, '_');
if(
componentId == NULL ) {
componentId = id;
}
Critical classes and methods
class Character: methods dealing with character properties
Example:
bool Character.isLetter( char c )
Requirment:
In order to handle surrogate pairs properly, an
interface for 32-bit characters (encoding UTF-32) is required.
Solution Approach:
class UCharacter of ICU4J offers such a 32-bit
interface and should be used instead of JDK class Characters.
class String/StringBuf: extract single characters from string
Example:
char String.charAt( int index )
Requirement:
A 16-bit return value is problematic if the 16-bit value is
part of a surrogate pairs or part of a combining character sequence.
Solution Approach (depending on the programming context):
- continue working on 16-bit codeunits (following operations do not destroy
surrogate pairs or composite character sequence)
- work on 32-bit characters (e.g. this is necessary to check character
properties)
- work on Graphemes represented as strings
This alternatives can be offered via static methods and/or via a character
iterator class.
class String: searching
Example:
int String.indexOf( char c )
Requirement:
When a matching character or string is found, the character
that immediately follows the matching character has to be checked, as well. If
the matching sequence is immediately followed by a combining character, than it
is not a valid match, because the combining character modifies the last
character of the matching sequence.
From a practical point of view, this point is not really critical in most
cases. We can distinguish 4 variants of searching:
- searching a delimiting character
delimiting characters are normally
special characters (like <,>,= ...) to which combining characters can't
be applied
- searching a token
tokens are separated by delimiting characters. So no
combining characters can follow
- searching a prefix
theoretically it could be possible, that a prefix
occurs with a following combining character, that invalidates the match, but
this is very unlikely to be a problem
- searching with like
here combinig characters are relevant, but anyway
like needs more than low level string search routines
Based on this considerations we think that we can ignore combining characters
in low level searching and handle them on a higher level where necessary
class String/StringBuffer: extracting parts of a string
Example:
String String.substring( int beginIndex, int endIndex)
Requirement:
When extracting parts from a string, avoid splitting
surrogate pairs and Graphemes.
Solution Approach (depending on the programming context):
- combine extraction with a preceding search operation (e.g. splitBefore,
splitAfter, cutPrefix ...)
- Completely remove from the result surrogate pairs and Graphemes that would
be split (e.g. when storing strings in a buffer with limited size)
- keep index access if the string has a fix format
class StringBuffer: modifying parts of a string
Example: StringBuffer StringBuffer.replace( int beginIndex, int endIndex,
String s )
The indices that mark the borders of the operation may not cut surrogate
pairs and Graphemes. Principially the same approaches can be applied as for
extracting parts from a string.
Rules and restrictions on strings that can be processed
- Strings may not contain unpaired surrogates. Unpaired surrogates may be
skipped or replaced with another character at any time.
- Corresponding to the W3C character model, strings must be normalized
early. That means that we normalize strings immediately when they are entered
by the user, and that we assume that strings are normalized when we receive
them from other software. We use Normalization from C (canonical decompositon
followed by canonical composition). Searching, sorting and idendity matching
may not work with unnormalized strings.
- Strings may not start with a combining character. Since string
concatenation with such strings may result in unnormalized strings, searching,
sorting and identity matching may not work.
- Identifiers shall not contain format characters (e.g. to indicate writing
direction from right to left). Format characters are not ignored when
searching for identifiers.
- In contexts where a character must be quoted, because it is used as
delimiter, this character must also be quoted if it is used as a base
character of a composite character sequence.
Possible support by check tools
A check tool can detect and warn, if one of the critical operations is
done.
It may be possible to avoid warnings, if the critical operations is used in a
save context. This can be:
- extracting a part of a string with indices that come from previous search
or stringlen operations
- searching for a character which does not permit combining characters (e.g.
\n)
- extracting a character as a 16-bit value from a string and comparing this
character to other characters that are neither surrogate pairs nor can be
followed by combining characters
-----Original Message-----
From:
Asmus Freytag [
mailto:[email protected]]
Sent:
Monday, August 12, 2002 7:37 PM
To: Mori, Nobuyoshi; 'Winkler, Arnold F';
'Thomas Plum'; 'John Benito';
'Ann Bennett'; 'Tom Plum (WG21)'; 'Frank
Farance'; 'John Hill'; 'Rex
Jaeschke'; 'Keld Jørn Simonsen'; 'Willem Wakker';
'Herb Sutter'; Mori,
Nobuyoshi
Cc: 'Matthew Deane'; 'Don
Schricker'
Subject: RE: Documents for Character Set Ad-hoc (agenda
time)
Mr Mori has asked what the boundaries of discussion should
be.
At Unicode we see three areas of discussion:
1 Representation of
characters
This is a datatype issue, and the discussion
of
the proposed UTF-16 support would fall here
2
Identifiers
This is the issue Tom referred to. There
are
several strategies worth discussing
3 Character
Properties / Processing
If we can make progress with the first
two, it
might make sense to broaden the discussion
to
consider this issue.
We are working on creating a
paper that we can submit
from the Unicode side to help with the discussion
of
these three issues - that's why I have refrained
from doing more than
list them here.
By accident of schedule, our next meeting is in
the
week directly before the ad-hoc meeting; we will try
to come up with
something earlier than that, if we can.
A./