JTC1/SC22/WG14 N717

WG14/N717
J11/97-080
1997-06-23
Thomas Plum
Wording for "Extended Identifiers"     [Revision #4, after voting]


In the text below, lines that start with 6 spaces are quoted
intact from C9X draft 9.  The lines at the left margin are
the proposed words to incorporate extended identifiers, 
taken generally verbatim from the second C++ CD 14882.


        5.1.1.2  Translation phases

        [#1] The precedence among the syntax rules of translation is
        specified by the following phases.

         1.  Physical source file  characters  are  mapped  to  the
             source  character set (introducing new-line characters
             for end-of-line indicators)  if  necessary.   Trigraph
             sequences   are   replaced  by  corresponding  single-
             character internal representations.
Any source file character not in the basic source character set  
is replaced by the universal�character�name that designates that 
character.*)
---------------
*) The process of handling extended characters is specified in terms 
of mapping to an encoding that uses only the basic source character 
set, and, in the case of character literals and strings, further 
mapping to the execution character set. In practical terms, however, 
any internal encoding may be used, so long as an actual extended 
character encountered in the input, and the same extended character 
expressed in the input as a universal�character�name (i.e. using the 
notation), are handled equivalently.
---------------
             [...]
         4.  Preprocessing   directives   are    executed,    macro
             invocations  are  expanded,  and pragma unary operator
             expressions are executed.   
If a character sequence that matches the syntax of a 
universal�character�name is produced by token concatenation 
(16.3.3), the behavior is undefined.
             A  #include  preprocessing
             directive causes the named header or source file to be
             processed from phase 1 through phase  4,  recursively.
             All preprocessing directives are then deleted.

         5.  Each source character set member,
escape  sequence, and universal-character-name
             in   character   constants   and  string  literals  is
             converted to a member of the execution character set.
         [etc as-is]

Constraints

A universal-character-name shall not specify a character short identifier
in the ranges 0000 through 0020 or 007F through 009F, inclusive.  A 
universal-character-name shall not designate a character in the basic source character set.


       5.2  Environmental considerations

       5.2.1  Character sets

       [#1] Two sets of characters and their  associated  collating
       sequences  shall  be defined:  the set in which source files
       are written,  and  the  set  interpreted  in  the  execution
       environment.   The  values  of  the members of the execution
       character set  are  implementation-defined;  any  additional
       members  beyond those required by this subclause are locale-
       specific.
[etc as-is, to the last paragraph of 5.2.1, then add...]

The universal�character�name construct provides a way to name other 
characters.

hex�quad: hexadecimal�digit hexadecimal�digit hexadecimal�digit hexadecimal�digit
universal�character�name: \u hex�quad 
                          \U hex�quad hex�quad

The character designated by the universal�character�name \UNNNNNNNN 
is that character whose character short identifier is
NNNNNNNN specified by ISO/IEC 10646 pDAM-9; 
the character designated by the 
universal�character�name \uNNNN is that character whose 
character short identifier is
0000NNNN specified by ISO/IEC 10646 pDAM-9.
[This wording reflects comments from Japan about C++ CD2.]

        Forward   references:    character   constants    (6.1.3.4),
        preprocessing  directives  (6.8),  string  literals (6.1.4),
        comments (6.1.9).
        
        [...]

        6.1.2  Identifiers

        Syntax

        [#1]

                identifier:
                        nondigit
                        identifier nondigit


                nondigit: one of
universal�character�name
                         _  a  b  c  d  e  f  g  h  i  j  k  l  m
                            n  o  p  q  r  s  t  u  v  w  x  y  z
                            A  B  C  D  E  F  G  H  I  J  K  L  M
                            N  O  P  Q  R  S  T  U  V  W  X  Y  Z

        [#2] An identifier is  a  sequence  of  nondigit  characters
        (including  the underscore _ and the lowercase and uppercase
        letters)  and  digits.   
Each universal�character�name in an identifier shall designate 
a character whose encoding in ISO 10646 
falls into one of the ranges specified in Annex xxx.*)
-----------------
*) On systems in which linkers cannot accept extended characters, 
an encoding of the universal�character�name may be used in forming 
valid external identifiers. For example, some otherwise unused 
character or sequence of characters may be used to encode the \u in 
a universal�character�name. Extended characters may produce a long 
external identifier. 
-----------------
        The  first  character  shall  be  a  nondigit character.

             [...]

       6.1.3.4  Character constants

       Syntax

       [#1]

               c-char:
                       any member of the source character set except
                               the single-quote ', backslash \, or 
                               new-line character
                       escape-sequence
universal-character-name

        6.1.4  String literals

        Syntax

        [#1]

                s-char:
                        any member of the source character set except
                                the double-quote ", backslash \, or 
                                new-line character
                        escape-sequence
universal-character-name


  ___________________________________________________________________

  Annex xxx (normative)

  Universal-character-names for identifiers
  ___________________________________________________________________

1 This Clause lists the hexadecimal code values that are valid  in  uni-
  versal-character-names in identifiers.

2 This  table  is reproduced unchanged from ISO/IEC PDTR 10176, produced
  by ISO/IEC  JTC1/SC22/WG20,  except  that  the  ranges  0041-005a  and
  0061-007a  designate the upper and lower case English alphabets, which
  are part of the basic source character set, and are  not  repeated  in
  the table below.*)
--------------
*) If PDTR 10176 is changed during its balloting
  and adoption as a TR, then this table should be changed to  match  its
  changes.
--------------


  Latin:   00c0-00d6,   00d8-00f6,   00f8-01f5,   01fa-0217,  0250-02a8,
  1e00-1e9a, 1ea0-1ef9

  Greek:  0384, 0388-038a, 038c, 038e-03a1, 03a3-03ce, 03d0-03d6,  03da,
  03dc,   03de,   03e0,   03e2-03f3,  1f00-1f15,  1f18-1f1d,  1f20-1f45,
  1f48-1f4d,  1f50-1f57,  1f59,  1f5b,   1f5d,   1f5f-1f7d,   1f80-1fb4,
  1fb6-1fbc,  1fc2-1fc4,  1fc6-1fcc,  1fd0-1fd3,  1fd6-1fdb,  1fe0-1fec,
  1ff2-1ff4, 1ff6-1ffc

  Cyrilic:   0401-040d,  040f-044f,  0451-045c,  045e-0481,   0490-04c4,
  04c7-04c8, 04cb-04cc, 04d0-04eb, 04ee-04f5, 04f8-04f9

  Armenian:  0531-0556, 0561-0587

  Hebrew:  05d0-05ea, 05f0-05f4

  Arabic:    0621-063a,   0640-0652,  0670-06b7,  06ba-06be,  06c0-06ce,
  06e5-06e7

  Devanagari:  0905-0939, 0958-0962

  Bengali:  0985-098c, 098f-0990, 0993-09a8, 09aa-09b0, 09b2, 09b6-09b9,
  09dc-09dd, 09df-09e1, 09f0-09f1

  Gurmukhi:   0a05-0a0a,  0a0f-0a10,  0a13-0a28,  0a2a-0a30,  0a32-0a33,
  0a35-0a36, 0a38-0a39, 0a59-0a5c, 0a5e

  Gujarati:    0a85-0a8b,   0a8d,   0a8f-0a91,   0a93-0aa8,   0aaa-0ab0,
  0ab2-0ab3, 0ab5-0ab9, 0ae0

  Oriya:    0b05-0b0c,   0b0f-0b10,   0b13-0b28,  0b2a-0b30,  0b32-0b33,
  0b36-0b39, 0b5c-0b5d, 0b5f-0b61

  Tamil:  0b85-0b8a, 0b8e-0b90, 0b92-0b95, 0b99-0b9a,  0b9c,  0b9e-0b9f,
  0ba3-0ba4, 0ba8-0baa, 0bae-0bb5, 0bb7-0bb9

  Telugu:    0c05-0c0c,   0c0e-0c10,  0c12-0c28,  0c2a-0c33,  0c35-0c39,
  0c60-0c61

  Kannada:   0c85-0c8c,  0c8e-0c90,  0c92-0ca8,  0caa-0cb3,   0cb5-0cb9,
  0ce0-0ce1

  Malayalam:  0d05-0d0c, 0d0e-0d10, 0d12-0d28, 0d2a-0d39, 0d60-0d61

  Thai:  0e01-0e30, 0e32-0e33, 0e40-0e46, 0e4f-0e5b

  Lao:   0e81-0e82,  0e84, 0e87, 0e88, 0e8a, 0e0d, 0e94-0e97, 0e99-0e9f,
  0ea1-0ea3, 0ea5,  0ea7,  0eaa,  0eab,  0ead-0eb0,  0eb2,  0eb3,  0ebd,
  0ec0-0ec4, 0ec6

  Georgian:  10a0-10c5, 10d0-10f6

  Hiragana:  3041-3094, 309b-309e

  Katakana:  30a1-30fe

  Bopmofo:  3105-312c

  Hangul:  1100-1159, 1161-11a2, 11a8-11f9


  CJK   Unified   Ideographs:   f900-fa2d,  fb1f-fb36,  fb38-fb3c, fb3e,
  fb40-fb41,  fb42-fb44,  fb46-fbb1,  fbd3-fd3f,  fd50-fd8f,  fd92-fdc7,
  fdf0-fdfb,   fe70-fe72,   fe74,   5e76-fefc,   ff21-ff3a,   ff41-ff5a,
  ff66-ffbe, ffc2-ffc7, ffca-ffcf, ffd2-ffd7, ffda-ffdc, 4e00-9fa5

[Denmark (Keld Simonsen) commented re C++ CD2: 
Due to the change in ISO/IEC 10646 of the encoding of Hangul characters,
we propose to change the allowable characters defined for extended
identifiers as follows:

Remove the range U3400..U4DFF
insert the range UAC00..UD7AF

This change has also been processed to DTR 10176.]