slightly expanded and updated the Unicode overview

git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@13059 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
2001-12-17 16:52:22 +00:00
parent be03c0ec26
commit 8f684821f6
1 changed files with 28 additions and 13 deletions
--- a/docs/latex/wx/tunicode.tex
+++ b/docs/latex/wx/tunicode.tex
@@ -20,9 +20,11 @@ characters from languages other than English.
 Starting with release 2.1 wxWindows has support for compiling in Unicode mode
 on the platforms which support it. Unicode is a standard for character
 encoding which addresses the shortcomings of the previous, 8 bit standards, by
-using 16 bit for encoding each character. This allows to have 65536 characters
+using at least 16 (and possibly 32) bits for encoding each character. This
-instead of the usual 256 and is sufficient to encode all of the world
+allows to have at least 65536 characters (what is called the BMP, or basic
-languages at once. More details about Unicode may be found at {\tt www.unicode.org}.
+multilingual plane) and possible $2^{32}$ of them instead of the usual 256 and
 is sufficient to encode all of the world languages at once. More details about
 Unicode may be found at {\tt www.unicode.org}.
 % TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ...
@@ -52,6 +54,8 @@ Basically, there are only a few things to watch out for:
 \item Character type ({\tt char} or {\tt wchar\_t})
 \item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'})
 \item String functions ({\tt strlen()}, {\tt strcpy()}, ...)
 \item Special preprocessor tokens ({\tt \_\_FILE\_\_}, {\tt \_\_DATE\_\_} 
 and {\tt \_\_TIME\_\_})
 \end{itemize}
 Let's look at them in order. First of all, each character in an Unicode
@@ -59,20 +63,27 @@ program takes 2 bytes instead of usual one, so another type should be used to
 store the characters ({\tt char} only holds 1 byte usually). This type is
 called {\tt wchar\_t} which stands for {\it wide-character type}.
-Also, the string and character constants should be encoded on 2 bytes instead
+Also, the string and character constants should be encoded using wide
-of one. This is achieved by using the standard C (and C++) way: just put the
+characters ({\tt wchar\_t} type) which typically take $2$ or $4$ bytes instead
-letter {\tt 'L'} after any string constant and it becomes a {\it long}
+of {\tt char} which only takes one. This is achieved by using the standard C
-constant, i.e. a wide character one. To make things a bit more readable, you
+(and C++) way: just put the letter {\tt 'L'} after any string constant and it
-are also allowed to prefix the constant with {\tt 'L'} instead of putting it
+becomes a {\it long} constant, i.e. a wide character one. To make things a bit
-after it.
+more readable, you are also allowed to prefix the constant with {\tt 'L'}
 instead of putting it after it.
-Finally, the standard C functions don't work with {\tt wchar\_t} strings, so
+Of course, the usual standard C functions don't work with {\tt wchar\_t}
-another set of functions exists which do the same thing but accept 
+strings, so another set of functions exists which do the same thing but accept
 {\tt wchar\_t *} instead of {\tt char *}. For example, a function to get the
 length of a wide-character string is called {\tt wcslen()} (compare with 
 {\tt strlen()} - you see that the only difference is that the "str" prefix
-standing for "string" has been replaced with "wcs" standing for
+standing for "string" has been replaced with "wcs" standing for "wide-character
-"wide-character string").
+string").
 And finally, the standard preprocessor tokens enumerated above expand to ANSI
 strings but it is more likely that Unicode strings are wanted in the Unicode
 build. wxWindows provides the macros {\tt \_\_TFILE\_\_}, {\tt \_\_TDATE\_\_} 
 and {\tt \_\_TTIME\_\_} which behave exactly as the standard ones except that
 they produce ANSI strings in ANSI build and Unicode ones in the Unicode build.
 To summarize, here is a brief example of how a program which can be compiled
 in both ANSI and Unicode modes could look like:
@@ -82,10 +93,14 @@ in both ANSI and Unicode modes could look like:
    wchar_t wch = L'*';
    const wchar_t *ws = L"Hello, world!";
    int len = wcslen(ws);
    wprintf(L"Compiled at %s\n", __TDATE__);
 #else // ANSI
    char ch = '*';
    const char *s = "Hello, world!";
    int len = strlen(s);
    printf("Compiled at %s\n", __DATE__);
 #endif // Unicode/ANSI
 \end{verbatim}