Doc No:	N1488
Date:	2010-05-26
Reply to:	Clark Nelson `<[email protected]>`

UTF-8 string literals

Introduction

At the 2008-09 (Milpitas) meeting, N1333 was voted into the WD. That paper proposed adding UTF-8 string literals and raw string literals, from C++0X, as well as the char16_t and char32_t string forms from TR 19769 (also from C++0X). Problems arose in adding raw string literals, related to differences in translation models between C and C++. At the 2009-03 meeting (Markham), N1333 was withdrawn (without prejudice) by a unanimous straw poll.

Since then, the string forms from TR 19769 were added to the WD. It was also decided that implementers can implement raw strings in their C compilers, and that the motivation for adding raw strings into C1X is feeble at best, especially considering their complexity.

However, UTF-8 strings do not suffer from the same complexity. They were combined with raw strings by historical accident: they were adopted by C++ at the same time, and modified the same passage of the standard. WG14 never consciously decided not to adopt UTF-8 strings; they too were caught up in the whirlwind that is raw strings, and were dropped inadvertently.

This paper describes changes to the WD necessary to add UTF-8 strings.

Notes on terminology

In C++, an unprefixed string literal token is called an "ordinary string literal". Ordinary string literals and UTF-8 string literals are both classed as "narrow string literals", because in both cases the element type is char. However, the term "narrow string literal" is actually used only three times in two paragraphs of the C++ WD.

In C, an unprefixed string literal is called a "character string literal". In the status quo, every prefixed string literal is (some kind of) wide string literal. The term "character string literal" is used in several places, especially in the preprocessor description, where it is clear that an unprefixed string literal is exactly what is meant. So the term "UTF-8 string literal" is introduced to refer to a prefixed string literal that is nevertheless not a wide string literal.

WG14 has already rejected a proposal to drop the term "character string literal" in favor of some other term (and WG21 has rejected the term "character string literal" for an unprefixed string literal).

The amended grammar for string-literal exactly parallels that in C++, omitting only the alternative for raw strings. The new non-terminal encoding-prefix is exactly as in C++. This new term considerably simplifies the discussion of string token concatenation.

WD changes

6.4.5 is presented in its entirety, with ~~deletions~~ and insertions indicated. (Some paragraphs, including the examples, are not modified.)

6.4.5 String literals

Syntax

1 string-literal:

encoding-prefix_opt " s-char-sequence_opt "

~~L" s-char-sequence_opt "~~

~~u" s-char-sequence_opt "~~

~~U" s-char-sequence_opt "~~

encoding-prefix:

u8

u

U

L

s-char-sequence:

s-char

s-char-sequence s-char

s-char:

any member of the source character set except
the double-quote ", backslash \, or new-line character

escape-sequence

Description

2 A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes, as in "xyz". A UTF-8 string literal is the same, except prefixed by u8. A wide string literal is the same, except prefixed by the letter L, u, or U.

3 The same considerations apply to each element of the sequence in a ~~character~~ string literal ~~or a wide string literal~~ as if it were in an integer character constant (for a character or UTF-8 string literal) or a wide character constant (for a wide string literal), except that the single-quote ' is representable either by itself or by the escape sequence \', but the double-quote " shall be represented by the escape sequence \".

Constraints

A sequence of adjacent string literal tokens shall not include both a wide string literal and a UTF-8 string literal.

NOTE: Effectively, this constraint is present in C++ as well. An unprefixed string literal is considered "generic" — it is converted to match what it is concatenated with. And there's a good chance that wchar_t is the same size as either char16_t or char32_t, so an implementation is free to accept concatenations of dissimilar kinds of wide strings without comment. But concatenation of a wide string with an explicitly narrow string is considered a gratuitous error.

Semantics

4 In translation phase 6, the multibyte character sequences specified by any sequence of adjacent character and identically-prefixed ~~wide~~ string literal tokens are concatenated into a single multibyte character sequence. If any of the tokens ~~are wide string literal tokens~~ has an encoding prefix, the resulting multibyte character sequence is treated as ~~a wide string literal with~~ having the same prefix; otherwise, it is treated as a character string literal. Whether differently-prefixed wide string literal tokens can be concatenated and, if so, the treatment of the resulting multibyte character sequence are implementation-defined.

5 In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals.^FOOTNOTE) The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char, and are initialized with the individual bytes of the multibyte character sequence. For UTF-8 string literals, the array elements have type char and are initialized with the characters of the multibyte character sequence, as encoded in UTF-8. For wide string literals prefixed by the letter L, the array elements have type wchar_t and are initialized with the sequence of wide characters corresponding to the multibyte character sequence, as defined by the mbstowcs function with an implementation-defined current locale. For wide string literals prefixed by the letter u or U, the array elements have type char16_t or char32_t, respectively, and are initialized with the sequence of wide characters corresponding to the multibyte character sequence, as defined by successive calls to the mbrtoc16, or mbrtoc32 function as appropriate for its type, with an implementation-defined current locale. The value of a string literal containing a multibyte character or escape sequence not represented in the execution character set is implementation-defined.

^FOOTNOTE) A ~~character~~ string literal need not be a string (see 7.1.1), because a null character may be embedded in it by a \0 escape sequence.

6 It is unspecified whether these arrays are distinct provided their elements have the appropriate values. If the program attempts to modify such an array, the behavior is undefined.
7 EXAMPLE 1 This pair of adjacent character string literals
"\x12" "3"
produces a single character string literal containing the two characters whose values are '\x12' and '3', because escape sequences are converted into single members of the execution character set just prior to adjacent string literal concatenation.

8 EXAMPLE 2 Each of the sequences of adjacent string literal tokens
"a" "b" L"c"
"a" L"b" "c"
L"a" "b" L"c"
L"a" L"b" L"c"
is equivalent to the string literal
L"abc"
Likewise, each of the sequences
"a" "b" u"c"
"a" u"b" "c"
u"a" "b" u"c"
u"a" u"b" u"c"
is equivalent to
u"abc"

In C++, the definition of "byte" was also augmented, along the following lines (see 3.6 of the C WD):

addressable unit of data storage large enough to hold any member of the basic character set of the execution environment, or any eight-bit code unit of the Unicode UTF-8 encoding form

I'm not 100% sure this change is necessary, given that CHAR_BIT is required to be at least 8 anyway, but it is presented here for consideration.

Change the appropriate bullet of 5.2.4.1p1:

4095 characters in a ~~character~~ string literal ~~or wide string literal~~ (after concatenation)

Change 6.7.9p14:

An array of character type may be initialized by a character string literal or UTF-8 string literal, optionally enclosed in braces. Successive characters of the ~~character~~ string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array.