git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@5109 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
		
			
				
	
	
		
			161 lines
		
	
	
		
			7.2 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			161 lines
		
	
	
		
			7.2 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | 
						|
%% Name:        tunicode.tex
 | 
						|
%% Purpose:     Overview of the Unicode support in wxWindows
 | 
						|
%% Author:      Vadim Zeitlin
 | 
						|
%% Modified by:
 | 
						|
%% Created:     22.09.99
 | 
						|
%% RCS-ID:      $Id$
 | 
						|
%% Copyright:   (c) 1999 Vadim Zeitlin <zeitlin@dptmaths.ens-cachan.fr>
 | 
						|
%% Licence:     wxWindows license
 | 
						|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | 
						|
 | 
						|
\section{Unicode support in wxWindows}\label{unicode}
 | 
						|
 | 
						|
This section briefly describes the state of the Unicode support in wxWindows.
 | 
						|
Read it if you want to know more about how to write programs able to work with
 | 
						|
characters from languages other than English.
 | 
						|
 | 
						|
\subsection{What is Unicode?}
 | 
						|
 | 
						|
Starting with release 2.1 wxWindows has support for compiling in Unicode mode
 | 
						|
on the platforms which support it. Unicode is a standard for character
 | 
						|
encoding which addreses the shortcomings of the previous, 8 bit standards, by
 | 
						|
using 16 bit for encoding each character. This allows to have 65536 characters
 | 
						|
instead of the usual 256 and is sufficient to encode all of the world
 | 
						|
languages at once. More details about Unicode may be found at {\tt www.unicode.org}.
 | 
						|
 | 
						|
% TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ...
 | 
						|
 | 
						|
As this solution is obviously preferable to the previous ones (think of
 | 
						|
incompatible encodings for the same language, locale chaos and so on), many
 | 
						|
modern ooperating systems support it. The probably first example is Windows NT
 | 
						|
which uses only Unicode internally since its very first version.
 | 
						|
 | 
						|
Writing internationalized programs is much easier with Unicode and, as the
 | 
						|
support for it improves, it should become more and more so. Moreover, in the
 | 
						|
Windows NT/2000 case, even the program which uses only standard ASCII can profit
 | 
						|
from using Unicode because they will work more efficiently - there will be no
 | 
						|
need for the system to convert all strings hte program uses to/from Unicode
 | 
						|
each time a system call is made.
 | 
						|
 | 
						|
\subsection{Unicode and ANSI modes}
 | 
						|
 | 
						|
As not all platforms supported by wxWindows support Unicode (fully) yet, in
 | 
						|
many cases it is unwise to write a program which can only work in Unicode
 | 
						|
environment. A better solution is to write programs in such way that they may
 | 
						|
be compiled either in ANSI (traditional) mode or in the Unicode one.
 | 
						|
 | 
						|
This can be achieved quite simply by using the means provided by wxWindows.
 | 
						|
Basicly, there are only a few things to watch out for:
 | 
						|
 | 
						|
\begin{itemize}
 | 
						|
\item Character type ({\tt char} or {\tt wchar\_t})
 | 
						|
\item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'})
 | 
						|
\item String functions ({\tt strlen()}, {\tt strcpy()}, ...)
 | 
						|
\end{itemize}
 | 
						|
 | 
						|
Let's look at them in order. First of all, each character in an Unicode
 | 
						|
program takes 2 bytes instead of usual one, so another type should be used to
 | 
						|
store the characters ({\tt char} only holds 1 byte usually). This type is
 | 
						|
called {\tt wchar\_t} which stands for {\it wide-character type}.
 | 
						|
 | 
						|
Also, the string and character constants should be encoded on 2 bytes instead
 | 
						|
of one. This is achieved by using the standard C (and C++) way: just put the
 | 
						|
letter {\tt 'L'} after any string constant and it becomes a {\it long}
 | 
						|
constant, i.e. a wide character one. To make things a bit more readable, you
 | 
						|
are also allowed to prefix the constant with {\tt 'L'} instead of putting it
 | 
						|
after it.
 | 
						|
 | 
						|
Finally, the standard C functions don't work with {\tt wchar\_t} strings, so
 | 
						|
another set of functions exists which do the same thing but accept 
 | 
						|
{\tt wchar\_t *} instead of {\tt char *}. For example, a function to get the
 | 
						|
length of a wide-character string is called {\tt wcslen()} (compare with 
 | 
						|
{\tt strlen()} - you see that the only difference is that the "str" prefix
 | 
						|
standing for "string" has been replaced with "wcs" standing for
 | 
						|
"wide-character string").
 | 
						|
 | 
						|
To summarize, here is a brief example of how a program which can be compiled
 | 
						|
in both ANSI and Unicode modes could look like:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
#ifdef __UNICODE__
 | 
						|
    wchar_t wch = L'*';
 | 
						|
    const wchar_t *ws = L"Hello, world!";
 | 
						|
    int len = wcslen(ws);
 | 
						|
#else // ANSI
 | 
						|
    char ch = '*';
 | 
						|
    const char *s = "Hello, world!";
 | 
						|
    int len = strlen(s);
 | 
						|
#endif // Unicode/ANSI
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Of course, it would be nearly impossibly to write such programs if it had to
 | 
						|
be done this way (try to imagine the number of {\tt \#ifdef UNICODE} an average
 | 
						|
program would have had!). Luckily, there is another way - see the next
 | 
						|
section.
 | 
						|
 | 
						|
\subsection{Unicode support in wxWindows}
 | 
						|
 | 
						|
In wxWindows, the code fragment froim above should be written instead:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
    wxChar ch = wxT('*');
 | 
						|
    wxString s = wxT("Hello, world!");
 | 
						|
    int len = s.Len();
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
What happens here? First of all, you see that there are no more {\tt \#ifdef}s
 | 
						|
at all. Instead, we define some types and macros which behave differently in
 | 
						|
the Unicode and ANSI builds and allows us to avoid using conditional
 | 
						|
compilation in the program itself.
 | 
						|
 | 
						|
We have a {\tt wxChar} type which maps either on {\tt char} or {\tt wchar\_t} 
 | 
						|
depending on the mode in which program is being compiled. There is no need for
 | 
						|
a separate type for strings though, because the standard 
 | 
						|
\helpref{wxString}{wxstring} supports Unicode, i.e. it stores either ANSI or
 | 
						|
Unicode strings depending on the compile mode.
 | 
						|
 | 
						|
Finally, there is a special {\tt wxT()} macro which should enclose all literal
 | 
						|
strings in the program. As it's easy to see comparing the last fragment with
 | 
						|
the one above, this macro expands to nothing in the (usual) ANSI mode and
 | 
						|
prefixes {\tt 'L'} to its argument in the Unicode mode.
 | 
						|
 | 
						|
The important conclusion is that if you use {\tt wxChar} instead of 
 | 
						|
{\tt char}, avoid using C style strings and use {\tt wxString} instead and
 | 
						|
don't forget to enclose all string literals inside {\tt wxT()} macro, your
 | 
						|
program automatically becomes (almost) Unicode compliant!
 | 
						|
 | 
						|
Just let us state once again the rules:
 | 
						|
 | 
						|
\begin{itemize}
 | 
						|
\item Always use {\tt wxChar} instead of {\tt char}
 | 
						|
\item Always enclose literal string constants in {\tt wxT()} macro unless
 | 
						|
they're already converted to the right representation (another standard
 | 
						|
wxWindows macro {\tt \_()} does it, so there is no need for {\tt wxT()} in this
 | 
						|
case) or you intend to pass the constant directly to an external function
 | 
						|
which doesn't accept wide-character strings.
 | 
						|
\item Use {\tt wxString} instead of C style strings.
 | 
						|
\end{itemize}
 | 
						|
 | 
						|
\subsection{Unicode and the outside world}
 | 
						|
 | 
						|
We have seen that it was easy to write Unicode programs using wxWindows types
 | 
						|
and macros, but it has been also mentioned that it isn't quite enough.
 | 
						|
Although everything works fine inside the program, things can get nasty when
 | 
						|
it tries to communicate with the outside world which, sadly, often expects
 | 
						|
ANSI strings (a notable exception is the entire Win32 API which accepts either
 | 
						|
Unicode or ANSI strings and which thus makes it unnecessary to ever perform
 | 
						|
any convertions in the program).
 | 
						|
 | 
						|
To get a ANSI string from a wxString, you may use the 
 | 
						|
mb\_str() function which always returns an ANSI
 | 
						|
string (independently of the mode - while the usual 
 | 
						|
\helpref{c\_str()}{wxstringcstr} returns a pointer to the internal
 | 
						|
representation which is either ASCII or Unicode). More rarely used, but still
 | 
						|
useful, is wc\_str() function which always returns
 | 
						|
the Unicode string.
 | 
						|
 | 
						|
% TODO describe fn_str(), wx_str(), wxCharBuf classes, ...
 | 
						|
% Please remember to put a blank line at the end of each file! (Tex2RTF 'issue')
 | 
						|
 |