Doc No: | N1488 |
---|---|
Date: | 2010-05-26 |
Reply to: | Clark Nelson <[email protected]> |
At the 2008-09 (Milpitas) meeting,
N1333 was
voted into the WD. That paper proposed adding UTF-8 string literals and raw string
literals, from C++0X, as well as the char16_t
and char32_t
string forms from TR 19769 (also from C++0X). Problems arose in adding raw string
literals, related to differences in translation models between C and C++. At the
2009-03 meeting (Markham), N1333 was withdrawn (without prejudice) by a unanimous
straw poll.
Since then, the string forms from TR 19769 were added to the WD. It was also decided that implementers can implement raw strings in their C compilers, and that the motivation for adding raw strings into C1X is feeble at best, especially considering their complexity.
However, UTF-8 strings do not suffer from the same complexity. They were combined with raw strings by historical accident: they were adopted by C++ at the same time, and modified the same passage of the standard. WG14 never consciously decided not to adopt UTF-8 strings; they too were caught up in the whirlwind that is raw strings, and were dropped inadvertently.
This paper describes changes to the WD necessary to add UTF-8 strings.
In C++, an unprefixed string literal token is called an "ordinary string literal".
Ordinary string literals and UTF-8 string literals are both classed as "narrow string
literals", because in both cases the element type is char
. However,
the term "narrow string literal" is actually used only three times in two paragraphs
of the C++ WD.
In C, an unprefixed string literal is called a "character string literal". In the status quo, every prefixed string literal is (some kind of) wide string literal. The term "character string literal" is used in several places, especially in the preprocessor description, where it is clear that an unprefixed string literal is exactly what is meant. So the term "UTF-8 string literal" is introduced to refer to a prefixed string literal that is nevertheless not a wide string literal.
WG14 has already rejected a proposal to drop the term "character string literal" in favor of some other term (and WG21 has rejected the term "character string literal" for an unprefixed string literal).
The amended grammar for string-literal exactly parallels that in C++, omitting only the alternative for raw strings. The new non-terminal encoding-prefix is exactly as in C++. This new term considerably simplifies the discussion of string token concatenation.
6.4.5 is presented in its entirety, with deletions and insertions
indicated. (Some paragraphs, including the examples, are not modified.)
6.4.5 String literals
Syntax
- 1 string-literal:
- encoding-prefixopt
"
s-char-sequenceopt"
L"
s-char-sequenceopt"
u"
s-char-sequenceopt"
U"
s-char-sequenceopt"
- encoding-prefix:
- u8
- u
- U
- L
- s-char-sequence:
- s-char
- s-char-sequence s-char
- s-char:
any member of the source character set except
the double-quote"
, backslash\
, or new-line character- escape-sequence
Description
2 A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes, as in
"xyz"
. A UTF-8 string literal is the same, except prefixed byu8
. A wide string literal is the same, except prefixed by the letterL
,u
, orU
.3 The same considerations apply to each element of the sequence in a
characterstring literalor a wide string literalas if it were in an integer character constant (for a character or UTF-8 string literal) or a wide character constant (for a wide string literal), except that the single-quote'
is representable either by itself or by the escape sequence\'
, but the double-quote"
shall be represented by the escape sequence\"
.Constraints
A sequence of adjacent string literal tokens shall not include both a wide string literal and a UTF-8 string literal.
NOTE: Effectively, this constraint is present in C++ as well. An unprefixed string
literal is considered "generic" — it is converted to match what it is concatenated
with. And there's a good chance that wchar_t
is the same size as either
char16_t
or char32_t
, so an implementation is free to
accept concatenations of dissimilar kinds of wide strings without comment. But concatenation
of a wide string with an explicitly narrow string is considered a gratuitous error.
Semantics
4 In translation phase 6, the multibyte character sequences specified by any sequence of adjacent character and identically-prefixed
widestring literal tokens are concatenated into a single multibyte character sequence. If any of the tokensare wide string literal tokenshas an encoding prefix, the resulting multibyte character sequence is treated asa wide string literal withhaving the same prefix; otherwise, it is treated as a character string literal. Whether differently-prefixed wide string literal tokens can be concatenated and, if so, the treatment of the resulting multibyte character sequence are implementation-defined.5 In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals.FOOTNOTE) The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type
char
, and are initialized with the individual bytes of the multibyte character sequence. For UTF-8 string literals, the array elements have typechar
and are initialized with the characters of the multibyte character sequence, as encoded in UTF-8. For wide string literals prefixed by the letterL
, the array elements have typewchar_t
and are initialized with the sequence of wide characters corresponding to the multibyte character sequence, as defined by thembstowcs
function with an implementation-defined current locale. For wide string literals prefixed by the letteru
orU
, the array elements have typechar16_t
orchar32_t
, respectively, and are initialized with the sequence of wide characters corresponding to the multibyte character sequence, as defined by successive calls to thembrtoc16
, ormbrtoc32
function as appropriate for its type, with an implementation-defined current locale. The value of a string literal containing a multibyte character or escape sequence not represented in the execution character set is implementation-defined.FOOTNOTE) A
characterstring literal need not be a string (see 7.1.1), because a null character may be embedded in it by a \0 escape sequence.6 It is unspecified whether these arrays are distinct provided their elements have the appropriate values. If the program attempts to modify such an array, the behavior is undefined.
7 EXAMPLE 1 This pair of adjacent character string literals
"\x12" "3"produces a single character string literal containing the two characters whose values are
'\x12'
and'3'
, because escape sequences are converted into single members of the execution character set just prior to adjacent string literal concatenation.8 EXAMPLE 2 Each of the sequences of adjacent string literal tokens
"a" "b" L"c" "a" L"b" "c" L"a" "b" L"c" L"a" L"b" L"c"is equivalent to the string literal
L"abc"Likewise, each of the sequences
"a" "b" u"c" "a" u"b" "c" u"a" "b" u"c" u"a" u"b" u"c"is equivalent to
u"abc"
In C++, the definition of "byte" was also augmented, along the following lines (see 3.6 of the C WD):
addressable unit of data storage large enough to hold any member of the basic character set of the execution environment, or any eight-bit code unit of the Unicode UTF-8 encoding form
I'm not 100% sure this change is necessary, given that CHAR_BIT
is required to be at least 8 anyway, but it is presented here for consideration.
Change the appropriate bullet of 5.2.4.1p1:
- 4095 characters in a
characterstring literalor wide string literal(after concatenation)
Change 6.7.9p14:
An array of character type may be initialized by a character string literal or UTF-8 string literal, optionally enclosed in braces. Successive characters of the
characterstring literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array.