Nxxx

JTC1/SC22/WG14 Document Nxxx

Title: Wide character code values for members of the basic character set.
Author: Raymond Mak
Author Affiliation: IBM Corp.
E-mail Address: [email protected]

Abstract : In using the C language to process Unicode, it is natural to bind wchar_t with UCS-2 or UCS-4. However this causes problem for EBCDIC based systems as Standard C imposes a restriction on the wide character code values. Specifically, the standard requires ('x' == L'x') to hold true if x is a member of the basic character set. This document explains the problem and suggests an amendment to the standard to provide leeway for EBCDIC systems.

Introduction:

C99 7.17 paragraph 2 specifies in part:

"...

wchar_t

which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales; the null character shall have the code value zero and each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant."

At issue here is the last part of the above sentence.

Since the code value of the basic characters in UCS-2 and UCS-4 are based on ASCII, EBCDIC systems cannot conform to the above sub clause if the encoding of wchar_t is UCS-2 or UCS-4. This makes it difficult for EBCDIC systems to use Unicode with the C language.

There is no programming situation that really requires this restriction. In fact, in can be argued that a program would naturally know the type of characters (wide or normal) it is processing; the appropriate character literal can always be used.

Note that a program can use the functions btowc and wctob (7.24.6.1 and .2) to handle mixed processing of wide and normal characters. Sub clause 7.17 offers little additional value.

Suggested Change to the Standard:

Change the last part of 7.17 paragraph 2 as follows:

"...

wchar_t

which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales; the null character shall have the code value zero; each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant if an implementation does not define __STDC_NARROW_WCHAR__."

The proposed change would allow an implementation to deviate from the last part of 7.17 paragraph 2 if the macro __STDC_NARROW_WCHAR__ is defined. This would not affect ASCII based systems, but would provide leeway for EBCDIC systems to process Unicode using C.

=====================================================================END