JTC1/SC22/WG14 N831

                    Document Number:  WG14 N831 / J11 98-030


                        C9X Revision Proposal
                        =====================

Title:  UCN Revision
Author:  Douglas A. Gwyn
Author Affiliation:  U.S. Army Research Laboratory
Postal Address:  6449 Tauler Ct., Columbia, MD 21045-4530 US
E-mail Address:  [email protected]
Telephone Number:  (301)394-2287
Fax Number:  (301)394-3591
Sponsor:  J11: Douglas A. Gwyn
Date:  29 May, 1998
Document History:  new proposal based on on-line discussion
Proposal Category:
   X  Correction
Area of Standard Affected:
   X  Language
   X  Preprocessor
Prior Art:  Plan 9 C compiler, C++ standard, C9x CD1
Target Audience:  programmers who need to use extended
characters in their C sources, especially string literals
Related Documents (if any):  C9x CD1
Proposal Attached:  X  Yes
Abstract:  The C9x draft requires replacement of extended
multibyte source characters by universal character names.
This is unwise in two situations: (a) string literals
contain multibyte characters in some encoding that doesn't
map onto ISO 10646 in a one-to-one manner, e.g. a shift
encoding; (b) extended characters are used that have no code
value assigned in ISO 10646, e.g. distinctive Chinese and
Japanese ideographs for which ISO 10646 assigns the same
code.  It is better to leave explicit extended multibyte
characters as they were written.
Proposal:  The basic idea is to not require extended
multibyte source characters to be mapped into anything else,
until translation phase 5 (where execution-time codes are
created).  Apart from this improvement, an attempt is made
to preserve the useful properties and characteristics of the
previous UCN-based specification.  The implementation of
this proposal is given with respect to C9x WD 1997-11-21 as
modified by previous editorial changes, but it should be
easy enough to apply to any more recent working draft:

In 5.1.1.2 Translation phases, change phase 1 to read:

   1.	Physical source file multibyte characters are mapped
	to the source character set (introducing new-line
	characters for end-of-line indicators) if necessary.
	Trigraph sequences, then universal character names,
	are replaced by corresponding single members of the
	source character set.

(Delete the footnote about handling extended characters.)

Note:  Trigraphs have to be processed first, since UCNs use
\ which is not defined in the ISO 646 invariant codeset.

In 5.1.1.2 Translation phases, phase 4, delete the sentence
concerning token concatenation producing a character
sequence that matches the syntax of a UCN.

In 5.1.1.2 Translation phases, change phase 5 to read:

   5.	Each source character set member and escape sequence
	in character constants and string literals is
	converted to a member of the execution character
	set.

In 5.2.1 Character sets, change paragraph 2 by attaching a
footnote * at the end of the first sentence (about \ escape
sequences):

     *	Backslash characters introducing universal character
	names have already been replaced in translation
	phase 1.

Note:  If the syntax spec for UCNs has been moved to around
6.1.2 Identifiers, it should be returned to around 5.2.1 (it
probably should have its own subsection after 5.2.1.1
Trigraphs).  This is required by the change in translation
phase.  See the following change:

In 6.1.2 Identifiers, Syntax, replace the expansion of 
nondigit as universal-character-name by:

		extended-source-character

and in 6.1.2 Identifiers, Description, replace the sentence
about universal character names with:

	Each extended source character in an identifier
	shall designate a character whose encoding in ISO
	10646-1 falls into one of the ranges specified in
	Annex I.*

(Footnote unchanged.)

Note:  It is essential that Annex I exclude the basic source
characters; otherwise, this spec needs further constraints.

I see in the latest draft the sentence preceding the one
just mentioned has been changed to mention UCNs; that part
must be changed to read:

	... and certain extended source characters

which will immediately be explained by the sentence above.

Also in the latest draft, the sentence following the one
replaced above has been changed to mention UCNs; that
sentence must be changed to read:

	The initial nondigit character shall not be an
	extended source character designating a digit.

In 6.1.3.4 Character constants, delete the expansion of
c-char as universal-character-name.

In 6.1.4 String literals, delete the expansion of s-char as
universal-character name.

Note:  "Any member of the source character set" includes
extended source characters, including those resulting from
UCN replacement in translation phase 1.

(Annex I Universal character names for identifiers seems OK
as is.)

In K.2 Undefined behavior, delete the second item (about
token concatenation producing a character sequence matching
the syntax of a UCN).

And that's it -- the effect is to allow UCNs to be used to
denote characters not in the local source character set, to
allow extended source characters in identifiers, and to map
multibyte and extended-resulting-from-UCN-replacement source
characters in character constants and string literals only
at the last possible moment, so shift codes etc. are intact.