JTC1/SC22/WG14 N951


WG14/N951

String literals and concatenation

Clive Feather
<[email protected]>
Last changed 2001-08-14


Introduction
============

There is an inconsistency in the rules for string literal concatenation
and the relationship between source and execution character sets. This
paper discusses this inconsistency and suggests a new model and associated
changes to the Standard.

This paper was written following discussions on the WG14 reflector, with
particular input from Tanaka Keishiro and Antoine Leca.


Standard text
=============

The following text from the Standard is relevant.

Translation phase 1:

     Physical source file multibyte characters are  mapped,
     in  an  implementation-defined  manner,  to the source
     character set  (introducing  new-line  characters  for
     end-of-line   indicators)   if   necessary.

Translation phase 3:

     The  source  file  is  decomposed  into  preprocessing
     tokens   and  sequences  of  white-space  characters
     (including comments).

Translation phase 5:

     Each source character set member and  escape  sequence
     in   character   constants   and  string  literals  is
     converted to the corresponding member of the execution
     character set; if there is no corresponding member, it
     is converted to an implementation-defined member other
     than the null (wide) character.

Translation phase 6:

     Adjacent string literal tokens are concatenated.

6.4.5#1:

     [#1]

     string-literal:
         " s-char-sequence-opt "
         L" s-char-sequence-opt "

     s-char-sequence:
         s-char
         s-char-sequence s-char

     s-char:
         any member of the source character set except the double-quote ",
             backslash \, or new-line character
         escape-sequence

6.4.5#4:

     [#4]   In  translation  phase  6,  the  multibyte  character
     sequences specified by any sequence  of  adjacent  character
     and  wide  string  literal  tokens  are  concatenated into a
     single multibyte character sequence.  If any of  the  tokens
     are  wide  string  literal  tokens,  the resulting multibyte
     character sequence is treated  as  a  wide  string  literal;
     otherwise, it is treated as a character string literal.

6.4.5#5:

     [#5] In translation phase 7, a byte or code of value zero is
     appended to each multibyte character sequence  that  results
     from   a  string  literal  or  literals.  The  multibyte
     character sequence is then used to initialize  an  array  of
     static  storage  duration  and  length  just  sufficient  to
     contain the sequence.  For character  string  literals,  the
     array  elements have type char, and are initialized with the
     individual bytes of the multibyte  character  sequence;  for
     wide  string literals, the array elements have type wchar_t,
     and are initialized with the  sequence  of  wide  characters
     corresponding   to  the  multibyte  character  sequence,  as
     defined by the mbstowcs  function  with  an  implementation-
     defined  current  locale.   The  value  of  a string literal
     containing a multibyte  character  or  escape  sequence  not
     represented    in    the    execution   character   set   is
     implementation-defined.


Problems
========

Consider code like:

     L"abc" "def"

The 6.4.5#4 text says that the multibyte sequences in the literals are
concatenated into a single sequence in translation phase 6. But, on the
other hand, multibyte characters were mapped to source characters in TP1,
and the source characters were then mapped to execution character set
characters in TP5. So there are no multibyte sequences available in TP6 to
be concatenated.

There are then further problems. Consider a string literal containing a
UCN:

     L"\u8868"

At TP5 this is converted to a member of the execution character set, but
at TP7 (6.4.5#5) this literal is supposed to generate a multibyte
character that can be fed to mbstowcs. Nowhere is it explained where this
multibyte character comes from.

Finally, consider an implementation where the two byte sequence 0x95 0x5C
is the source encoding of U+8868. Look at the following literals:

     L"@\"              (@ represents the byte with value 0x95)
     L"\x95\x5C"
     L"\x95" "\\"

At TP5 the second of these is effectively converted to the first, and
after concatenation in TP6 so is the third. This means that all of these
literals generate an array of one element, holding the wide character
with value 0x955C. This is somewhat counter-intuitive, and Tanaka-san
states that it is not what users will expect or implementers will produce.

The alternative is to assume that TP1 will convert the first literal to
some internal character. But in this case TP7 lacks anything obvious to
pass to mbstowcs, and the other two cases still generate the "wrong"
answer.


Some examples of desired output
===============================

Our next step was to consider a range of examples and note what we thought
they should produce.

Example               Array type       Array contents

1:  "ABC"             (char [4])    { 0x41, 0x42, 0x43, 0x00 }
2:  "\x12" "34"       (char [4])    { 0x12, 0x33, 0x34, 0x00 }
3:  "\x95" "\\"       (char [3])    { 0x95, 0x5C, 0x00 }
4:  "@\"              (char [3])    { 0x95, 0x5C, 0x00 }
5:: "@" "\\"          (char [3])    { 0x95, 0x5C, 0x00 } OR UNDEFINED

6:  L"ABC"            (wchar_t [4]) { 0x0041, 0x0042, 0x0043, 0x0000 }
7:  L"\u8868"         (wchar_t [2]) { 0x955C, 0x0000 }
8:  L"\x95\\"         (wchar_t [3]) { 0x0095, 0x005C, 0x0000 }
9:  L"\x95" L"\\"     (wchar_t [3]) { 0x0095, 0x005C, 0x0000 }
10: "\x95" L"\\"      (wchar_t [3]) { 0x0095, 0x005C, 0x0000 }
11: L"@\"             (wchar_t [2]) { 0x955C, 0x0000 }
12: L"\x955C"         (wchar_t [2]) { 0x955C, 0x0000 }
13: L"\x95"           (wchar_t [2]) { 0x0095, 0x0000 }

14: "@\\"             UNDEFINED
15: "@" "\"           UNDEFINED

Example 14 is undefined because \" is an escape sequence and so the
literal is unterminated.

Example 15 depends on whether @" is a valid multibyte sequence or not.
If it is, then the third " terminates the literal and the backslash causes
a syntax error. If it is not, the second literal is unterminated.

Example 5 is defined or undefined in the same way.


Principles
==========

 From consideration of various examples we can derive a set of basic
principles for string literals.

[P1] The sequences:

     L"a" L"b"
     L"a"  "b"
      "a" L"b"

are completely equivalent. The final type of a concatenated string literal
depends only on whether any of the components have an L prefix, and not on
which ones they are.

[P2] The sequences:

     "abc"
     "ab" "c"
     "a" "bc"
     "abc"

are completely equivalent. The division of the string into literals does
not alter the final array. However, this applies only when the literals
consist of the same s-chars; the sequences:

     "\x1234"
     "\x12" "34"

are not equivalent because they involve different s-chars.

[P3] Multibyte sequences are converted to single source characters during
TP1, and so each multibyte sequence is a single s-char.

[P4] The literal "@\" contains one s-char but the literal "\x95\\"
contains two. These are not equivalent, and the latter is not merged to
form a multibyte character later on.

[P5] The two string literals:
      "abc"
     L"abc"
should be related. More precisely, applying mbstowcs to the former should
produce the latter.

[P6] When the final result will be a wchar_t array, each s-char in the
source generates exactly one element of the array.

[P7] When the final result will be a char array:
- a single byte source character generates exactly one byte
- an escape sequence generates exactly one byte
- a non-single byte multibyte source character generates one or more
   bytes, and:
   * mbstowcs applied to the sequence produces a single wide character;
   * where it makes sense, the byte sequence in the array is the direct
     analogue of the source multibyte character.

[P8] When the final result will be a wchar_t array, source shift sequences
are not separate s-chars and do not map to separate elements of the array.

[P9] WHen the final result will be a char array, source shift sequences
should appear in the array to the extent it makes sense (by analogue with
the last sub-bullet of P7).


New model
=========

Applying these principles to the processes in the Standard, we can
construct a new model.

The source character set contains the 95 required characters and the "new
line" indicator. It also contains as many additional characters as are
defined by every valid multibyte character (and making allowance for shift
states).

For example, suppose that a given encoding consists of:
- codes 1 to 96 are the required characters;
- codes 101 to 120 are always followed by a code from 1 to 100, and each
   pair represents a character;
- codes 121 to 127 each represent one of four characters depending on the
   choice of shift state;
- codes 97 to 100 select a shift state; this only affects codes 121 to 127.
Therefore the entire encoding contains 96+20*100+8*4 = 2128 characters,
and that is the size of the source character set.

Translation phase 1 converts all input to characters from this set. Thus
the sequence:

     1  81  81  78  46 100 122 122 101  54
     A   ?   ?   /   t       $   $       `

is converted to the 6 source character sequence A\t$$'

If a source character can be generated in more than one way (e.g. through
the use of alternative shift sequences), an implementation is free to
annotate the character with this information. This annotation is used
later.

Within string literals, these sequences are parsed into s-chars during
TP3; in this case there are 5 such s-chars. Other source code also works
in terms of these source characters.

TP4 stringisation and token pasting works in terms of these source
characters.

The execution character set needs essentially the same set of characters
as the source had. At TP5 each s-char in a string literal is converted to
the corresponding execution character set character. At this point the
distinction between multibyte characters, UCNs, and other escape sequences
is lost (so \t, \x9 (or whatever), and an actual source tab all produce
the same character). At TP6 the sequences of characters are simply
concatenated without further change.

At TP7 each character in the execution character set generates either:
- a single wide character
- a sequence of characters
In the latter case, if the corresponding s-char came from a multibyte
character the sequence should match it if possible. The annotation
mechanism described above is one way to do this.


Proposed changes
================

The following changes to the Standard are required to put this model into
effect.

Firstly we specify this model in some detail:

     5.2.1.3 Character encoding model

     [#1] Translation phase 1 establishes the boundaries between multibyte
     characters in the source. These are converted into /source character
     encoding units/ that encode a single member of the source character
     set (any shift sequences are merged with an adjacent unit). Source
     character encoding units are never split or merged in subsequent
     translation phases.

     [#2] In translation phase 3, each source character encoding unit that
     is not a member of the basic character set will become:
     - an identifier-nondigit within an identifier or pp-number
     - an h-char or q-char in a header-name
     - a c-char within a character-constant
     - an s-char within a string-literal, or
     - a preprocessing-token on its own.

     [#3] In translation phase 5, each c-char or s-char is converted to a
     single /execution character encoding unit/ (ECEU). Each character
     constant and string literal therefore becomes a sequence of ECEUs.
     Note that there may be several representations of the same ECEU:
     - a source character encoding unit, possibly derived from a multibyte
       sequence
     - a universal character name,
     - a special escape sequence such as \t, or
     - an octal or hexadecimal escape sequence

     [#4] In translation phase 6, string literals are concatenated by
     concatenating the ECEU sequences into a single sequence; the
     total number of ECEUs involved is unchanged.

     [#5] In translation phase 7, a string literal is converted to an array
     of values by first appending an ECEU, representing the null
     character, to the ECEU sequence. If it is a character string literal,
     each ECEU then generates one or more elements of the char array; the
     precise elements generated may depend on the source code encoding
     unit that the ECEU derives from. If it is a wide string literal,
     each ECEU generates one element of the wchar_t array.

     [#6] Two character string literals or two wide string literals derived
     from the same sequence of source character encoding units shall
     generate identical arrays. A character string literal and a wide
     string literal derived from the same sequence shall generate arrays
     that correspond, as defined by the mbstowcs function with an
     implementation-defined current locale.

Next we need to make the explanation of string concatenation in 6.4.5#4
to use this new model. This completely replaces the old text:

     [#4] In translation phase 6, the contents of adjacent
     character and wide string literal tokens are concatenated into
     a single token as described in 5.2.1.3. If any of the tokens
     are  wide  string  literal  tokens,  the resulting token is
     a wide string literal; otherwise, it is a character string
     literal.

Finally we need to make the explanation of string literals in 6.4.5#5
also use this new model. Again, this completely replaces the old text.

     [#5] In translation phase 7, a code of value zero (representing the
     null character) is appended to each string literal. The contents of
     the literal are then used to initialize an array of static storage
     duration and length just sufficient to contain the sequence. For
     character string literals, the array elements have type char; for
     wide string literals, the array elements have type wchar_t. The
     array is initialized as described in 5.2.1.3.

[No text replaces the last sentence of the current #5, as it duplicates
a requirement in TP5.]

If it is preferred that 6.4.5#5 not contain the reference to 5.2.1.3,
an alternative way to word the former would be:

     [#5] In translation phase 7, a code of value zero (representing the
     null character) is appended to each string literal. The contents of
     the literal are then used to initialize an array of static storage
     duration and length just sufficient to contain the sequence. For
     character string literals, the array elements have type char; each
     ECEU in the string literal (taken in order) determines the value of
     one or more elements (the precise values may depend on the source
     code encoding unit(s) that the ECEU derives from). For wide string
     literals, the array elements have type wchar_t; each ECEU in the
     string literal determines the initial value for the corresponding
     element.

If so, 5.2.1.3#5 should be deleted and "in translation phase 7" should be
added to the end of the first sentence of 5.2.1.3#6.