git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@43838 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
		
			
				
	
	
		
			201 lines
		
	
	
		
			9.2 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			201 lines
		
	
	
		
			9.2 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
| %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | |
| %% Name:        tunicode.tex
 | |
| %% Purpose:     Overview of the Unicode support in wxWidgets
 | |
| %% Author:      Vadim Zeitlin
 | |
| %% Modified by:
 | |
| %% Created:     22.09.99
 | |
| %% RCS-ID:      $Id$
 | |
| %% Copyright:   (c) 1999 Vadim Zeitlin <zeitlin@dptmaths.ens-cachan.fr>
 | |
| %% Licence:     wxWindows license
 | |
| %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | |
| 
 | |
| \section{Unicode support in wxWidgets}\label{unicode}
 | |
| 
 | |
| This section briefly describes the state of the Unicode support in wxWidgets.
 | |
| Read it if you want to know more about how to write programs able to work with
 | |
| characters from languages other than English.
 | |
| 
 | |
| \subsection{What is Unicode?}\label{whatisunicode}
 | |
| 
 | |
| wxWidgets has support for compiling in Unicode mode
 | |
| on the platforms which support it. Unicode is a standard for character
 | |
| encoding which addresses the shortcomings of the previous, 8 bit standards, by
 | |
| using at least 16 (and possibly 32) bits for encoding each character. This
 | |
| allows to have at least 65536 characters (what is called the BMP, or basic
 | |
| multilingual plane) and possible $2^{32}$ of them instead of the usual 256 and
 | |
| is sufficient to encode all of the world languages at once. More details about
 | |
| Unicode may be found at {\tt www.unicode.org}.
 | |
| 
 | |
| % TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ...
 | |
| 
 | |
| As this solution is obviously preferable to the previous ones (think of
 | |
| incompatible encodings for the same language, locale chaos and so on), many
 | |
| modern operating systems support it. The probably first example is Windows NT
 | |
| which uses only Unicode internally since its very first version.
 | |
| 
 | |
| Writing internationalized programs is much easier with Unicode and, as the
 | |
| support for it improves, it should become more and more so. Moreover, in the
 | |
| Windows NT/2000 case, even the program which uses only standard ASCII can profit
 | |
| from using Unicode because they will work more efficiently - there will be no
 | |
| need for the system to convert all strings the program uses to/from Unicode
 | |
| each time a system call is made.
 | |
| 
 | |
| \subsection{Unicode and ANSI modes}\label{unicodeandansi}
 | |
| 
 | |
| As not all platforms supported by wxWidgets support Unicode (fully) yet, in
 | |
| many cases it is unwise to write a program which can only work in Unicode
 | |
| environment. A better solution is to write programs in such way that they may
 | |
| be compiled either in ANSI (traditional) mode or in the Unicode one.
 | |
| 
 | |
| This can be achieved quite simply by using the means provided by wxWidgets.
 | |
| Basically, there are only a few things to watch out for:
 | |
| 
 | |
| \begin{itemize}
 | |
| \item Character type ({\tt char} or {\tt wchar\_t})
 | |
| \item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'})
 | |
| \item String functions ({\tt strlen()}, {\tt strcpy()}, ...)
 | |
| \item Special preprocessor tokens ({\tt \_\_FILE\_\_}, {\tt \_\_DATE\_\_} 
 | |
| and {\tt \_\_TIME\_\_})
 | |
| \end{itemize}
 | |
| 
 | |
| Let's look at them in order. First of all, each character in an Unicode
 | |
| program takes 2 bytes instead of usual one, so another type should be used to
 | |
| store the characters ({\tt char} only holds 1 byte usually). This type is
 | |
| called {\tt wchar\_t} which stands for {\it wide-character type}.
 | |
| 
 | |
| Also, the string and character constants should be encoded using wide
 | |
| characters ({\tt wchar\_t} type) which typically take $2$ or $4$ bytes instead
 | |
| of {\tt char} which only takes one. This is achieved by using the standard C
 | |
| (and C++) way: just put the letter {\tt 'L'} after any string constant and it
 | |
| becomes a {\it long} constant, i.e. a wide character one. To make things a bit
 | |
| more readable, you are also allowed to prefix the constant with {\tt 'L'}
 | |
| instead of putting it after it.
 | |
| 
 | |
| Of course, the usual standard C functions don't work with {\tt wchar\_t}
 | |
| strings, so another set of functions exists which do the same thing but accept
 | |
| {\tt wchar\_t *} instead of {\tt char *}. For example, a function to get the
 | |
| length of a wide-character string is called {\tt wcslen()} (compare with 
 | |
| {\tt strlen()} - you see that the only difference is that the "str" prefix
 | |
| standing for "string" has been replaced with "wcs" standing for "wide-character
 | |
| string").
 | |
| 
 | |
| And finally, the standard preprocessor tokens enumerated above expand to ANSI
 | |
| strings but it is more likely that Unicode strings are wanted in the Unicode
 | |
| build. wxWidgets provides the macros {\tt \_\_TFILE\_\_}, {\tt \_\_TDATE\_\_} 
 | |
| and {\tt \_\_TTIME\_\_} which behave exactly as the standard ones except that
 | |
| they produce ANSI strings in ANSI build and Unicode ones in the Unicode build.
 | |
| 
 | |
| To summarize, here is a brief example of how a program which can be compiled
 | |
| in both ANSI and Unicode modes could look like:
 | |
| 
 | |
| \begin{verbatim}
 | |
| #ifdef __UNICODE__
 | |
|     wchar_t wch = L'*';
 | |
|     const wchar_t *ws = L"Hello, world!";
 | |
|     int len = wcslen(ws);
 | |
| 
 | |
|     wprintf(L"Compiled at %s\n", __TDATE__);
 | |
| #else // ANSI
 | |
|     char ch = '*';
 | |
|     const char *s = "Hello, world!";
 | |
|     int len = strlen(s);
 | |
| 
 | |
|     printf("Compiled at %s\n", __DATE__);
 | |
| #endif // Unicode/ANSI
 | |
| \end{verbatim}
 | |
| 
 | |
| Of course, it would be nearly impossibly to write such programs if it had to
 | |
| be done this way (try to imagine the number of {\tt \#ifdef UNICODE} an average
 | |
| program would have had!). Luckily, there is another way - see the next
 | |
| section.
 | |
| 
 | |
| \subsection{Unicode support in wxWidgets}\label{unicodeinsidewxw}
 | |
| 
 | |
| In wxWidgets, the code fragment from above should be written instead:
 | |
| 
 | |
| \begin{verbatim}
 | |
|     wxChar ch = wxT('*');
 | |
|     wxString s = wxT("Hello, world!");
 | |
|     int len = s.Len();
 | |
| \end{verbatim}
 | |
| 
 | |
| What happens here? First of all, you see that there are no more {\tt \#ifdef}s
 | |
| at all. Instead, we define some types and macros which behave differently in
 | |
| the Unicode and ANSI builds and allow us to avoid using conditional
 | |
| compilation in the program itself.
 | |
| 
 | |
| We have a {\tt wxChar} type which maps either on {\tt char} or {\tt wchar\_t} 
 | |
| depending on the mode in which program is being compiled. There is no need for
 | |
| a separate type for strings though, because the standard 
 | |
| \helpref{wxString}{wxstring} supports Unicode, i.e. it stores either ANSI or
 | |
| Unicode strings depending on the compile mode.
 | |
| 
 | |
| Finally, there is a special \helpref{wxT()}{wxt} macro which should enclose all
 | |
| literal strings in the program. As it is easy to see comparing the last
 | |
| fragment with the one above, this macro expands to nothing in the (usual) ANSI
 | |
| mode and prefixes {\tt 'L'} to its argument in the Unicode mode.
 | |
| 
 | |
| The important conclusion is that if you use {\tt wxChar} instead of 
 | |
| {\tt char}, avoid using C style strings and use {\tt wxString} instead and
 | |
| don't forget to enclose all string literals inside \helpref{wxT()}{wxt} macro, your
 | |
| program automatically becomes (almost) Unicode compliant!
 | |
| 
 | |
| Just let us state once again the rules:
 | |
| 
 | |
| \begin{itemize}
 | |
| \item Always use {\tt wxChar} instead of {\tt char}
 | |
| \item Always enclose literal string constants in \helpref{wxT()}{wxt} macro
 | |
| unless they're already converted to the right representation (another standard
 | |
| wxWidgets macro \helpref{\_()}{underscore} does it, for example, so there is no
 | |
| need for {\tt wxT()} in this case) or you intend to pass the constant directly
 | |
| to an external function which doesn't accept wide-character strings.
 | |
| \item Use {\tt wxString} instead of C style strings.
 | |
| \end{itemize}
 | |
| 
 | |
| \subsection{Unicode and the outside world}\label{unicodeoutsidewxw}
 | |
| 
 | |
| We have seen that it was easy to write Unicode programs using wxWidgets types
 | |
| and macros, but it has been also mentioned that it isn't quite enough.
 | |
| Although everything works fine inside the program, things can get nasty when
 | |
| it tries to communicate with the outside world which, sadly, often expects
 | |
| ANSI strings (a notable exception is the entire Win32 API which accepts either
 | |
| Unicode or ANSI strings and which thus makes it unnecessary to ever perform
 | |
| any conversions in the program). GTK 2.0 only accepts UTF-8 strings.
 | |
| 
 | |
| To get an ANSI string from a wxString, you may use the 
 | |
| mb\_str() function which always returns an ANSI
 | |
| string (independently of the mode - while the usual 
 | |
| \helpref{c\_str()}{wxstringcstr} returns a pointer to the internal
 | |
| representation which is either ASCII or Unicode). More rarely used, but still
 | |
| useful, is wc\_str() function which always returns
 | |
| the Unicode string.
 | |
| 
 | |
| Sometimes it is also necessary to go from ANSI strings to wxStrings.  
 | |
| In this case, you can use the converter-constructor, as follows:
 | |
|  
 | |
| \begin{verbatim}
 | |
|    const char* ascii_str = "Some text";
 | |
|    wxString str(ascii_str, wxConvUTF8);
 | |
| \end{verbatim}
 | |
| 
 | |
| This code also compiles fine under a non-Unicode build of wxWidgets,
 | |
| but in that case the converter is ignored.
 | |
| 
 | |
| For more information about converters and Unicode see
 | |
| the \helpref{wxMBConv classes overview}{mbconvclasses}.
 | |
| 
 | |
| % TODO describe fn_str(), wx_str(), wxCharBuf classes, ...
 | |
| 
 | |
| \subsection{Unicode-related compilation settings}\label{unicodesettings}
 | |
| 
 | |
| You should define {\tt wxUSE\_UNICODE} to $1$ to compile your program in
 | |
| Unicode mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you
 | |
| compile your program in ANSI mode you can still define {\tt wxUSE\_WCHAR\_T} 
 | |
| to get some limited support for {\tt wchar\_t} type.
 | |
| 
 | |
| This will allow your program to perform conversions between Unicode strings and
 | |
| ANSI ones (using \helpref{wxMBConv classes}{mbconvclasses}) 
 | |
| and construct wxString objects from Unicode strings (presumably read
 | |
| from some external file or elsewhere).
 | |
| 
 |