slightly expanded and updated the Unicode overview

git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@13059 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
This commit is contained in:
Vadim Zeitlin
2001-12-17 16:52:22 +00:00
parent be03c0ec26
commit 8f684821f6

View File

@@ -20,9 +20,11 @@ characters from languages other than English.
Starting with release 2.1 wxWindows has support for compiling in Unicode mode Starting with release 2.1 wxWindows has support for compiling in Unicode mode
on the platforms which support it. Unicode is a standard for character on the platforms which support it. Unicode is a standard for character
encoding which addresses the shortcomings of the previous, 8 bit standards, by encoding which addresses the shortcomings of the previous, 8 bit standards, by
using 16 bit for encoding each character. This allows to have 65536 characters using at least 16 (and possibly 32) bits for encoding each character. This
instead of the usual 256 and is sufficient to encode all of the world allows to have at least 65536 characters (what is called the BMP, or basic
languages at once. More details about Unicode may be found at {\tt www.unicode.org}. multilingual plane) and possible $2^{32}$ of them instead of the usual 256 and
is sufficient to encode all of the world languages at once. More details about
Unicode may be found at {\tt www.unicode.org}.
% TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ... % TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ...
@@ -52,6 +54,8 @@ Basically, there are only a few things to watch out for:
\item Character type ({\tt char} or {\tt wchar\_t}) \item Character type ({\tt char} or {\tt wchar\_t})
\item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'}) \item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'})
\item String functions ({\tt strlen()}, {\tt strcpy()}, ...) \item String functions ({\tt strlen()}, {\tt strcpy()}, ...)
\item Special preprocessor tokens ({\tt \_\_FILE\_\_}, {\tt \_\_DATE\_\_}
and {\tt \_\_TIME\_\_})
\end{itemize} \end{itemize}
Let's look at them in order. First of all, each character in an Unicode Let's look at them in order. First of all, each character in an Unicode
@@ -59,20 +63,27 @@ program takes 2 bytes instead of usual one, so another type should be used to
store the characters ({\tt char} only holds 1 byte usually). This type is store the characters ({\tt char} only holds 1 byte usually). This type is
called {\tt wchar\_t} which stands for {\it wide-character type}. called {\tt wchar\_t} which stands for {\it wide-character type}.
Also, the string and character constants should be encoded on 2 bytes instead Also, the string and character constants should be encoded using wide
of one. This is achieved by using the standard C (and C++) way: just put the characters ({\tt wchar\_t} type) which typically take $2$ or $4$ bytes instead
letter {\tt 'L'} after any string constant and it becomes a {\it long} of {\tt char} which only takes one. This is achieved by using the standard C
constant, i.e. a wide character one. To make things a bit more readable, you (and C++) way: just put the letter {\tt 'L'} after any string constant and it
are also allowed to prefix the constant with {\tt 'L'} instead of putting it becomes a {\it long} constant, i.e. a wide character one. To make things a bit
after it. more readable, you are also allowed to prefix the constant with {\tt 'L'}
instead of putting it after it.
Finally, the standard C functions don't work with {\tt wchar\_t} strings, so Of course, the usual standard C functions don't work with {\tt wchar\_t}
another set of functions exists which do the same thing but accept strings, so another set of functions exists which do the same thing but accept
{\tt wchar\_t *} instead of {\tt char *}. For example, a function to get the {\tt wchar\_t *} instead of {\tt char *}. For example, a function to get the
length of a wide-character string is called {\tt wcslen()} (compare with length of a wide-character string is called {\tt wcslen()} (compare with
{\tt strlen()} - you see that the only difference is that the "str" prefix {\tt strlen()} - you see that the only difference is that the "str" prefix
standing for "string" has been replaced with "wcs" standing for standing for "string" has been replaced with "wcs" standing for "wide-character
"wide-character string"). string").
And finally, the standard preprocessor tokens enumerated above expand to ANSI
strings but it is more likely that Unicode strings are wanted in the Unicode
build. wxWindows provides the macros {\tt \_\_TFILE\_\_}, {\tt \_\_TDATE\_\_}
and {\tt \_\_TTIME\_\_} which behave exactly as the standard ones except that
they produce ANSI strings in ANSI build and Unicode ones in the Unicode build.
To summarize, here is a brief example of how a program which can be compiled To summarize, here is a brief example of how a program which can be compiled
in both ANSI and Unicode modes could look like: in both ANSI and Unicode modes could look like:
@@ -82,10 +93,14 @@ in both ANSI and Unicode modes could look like:
wchar_t wch = L'*'; wchar_t wch = L'*';
const wchar_t *ws = L"Hello, world!"; const wchar_t *ws = L"Hello, world!";
int len = wcslen(ws); int len = wcslen(ws);
wprintf(L"Compiled at %s\n", __TDATE__);
#else // ANSI #else // ANSI
char ch = '*'; char ch = '*';
const char *s = "Hello, world!"; const char *s = "Hello, world!";
int len = strlen(s); int len = strlen(s);
printf("Compiled at %s\n", __DATE__);
#endif // Unicode/ANSI #endif // Unicode/ANSI
\end{verbatim} \end{verbatim}