git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@52448 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
		
			
				
	
	
		
			208 lines
		
	
	
		
			8.8 KiB
		
	
	
	
		
			C
		
	
	
	
	
	
			
		
		
	
	
			208 lines
		
	
	
		
			8.8 KiB
		
	
	
	
		
			C
		
	
	
	
	
	
/////////////////////////////////////////////////////////////////////////////
 | 
						|
// Name:        unicode.h
 | 
						|
// Purpose:     topic overview
 | 
						|
// Author:      wxWidgets team
 | 
						|
// RCS-ID:      $Id$
 | 
						|
// Licence:     wxWindows license
 | 
						|
/////////////////////////////////////////////////////////////////////////////
 | 
						|
 | 
						|
/**
 | 
						|
 | 
						|
@page overview_unicode Unicode Support in wxWidgets
 | 
						|
 | 
						|
This section briefly describes the state of the Unicode support in wxWidgets.
 | 
						|
Read it if you want to know more about how to write programs able to work with
 | 
						|
characters from languages other than English.
 | 
						|
 | 
						|
@li @ref overview_unicode_what
 | 
						|
@li @ref overview_unicode_ansi
 | 
						|
@li @ref overview_unicode_supportin
 | 
						|
@li @ref overview_unicode_supportout
 | 
						|
@li @ref overview_unicode_settings
 | 
						|
@li @ref overview_unicode_traps
 | 
						|
 | 
						|
 | 
						|
<hr>
 | 
						|
 | 
						|
 | 
						|
@section overview_unicode_what What is Unicode?
 | 
						|
 | 
						|
wxWidgets has support for compiling in Unicode mode on the platforms which
 | 
						|
support it. Unicode is a standard for character encoding which addresses the
 | 
						|
shortcomings of the previous, 8 bit standards, by using at least 16 (and
 | 
						|
possibly 32) bits for encoding each character. This allows to have at least
 | 
						|
65536 characters (what is called the BMP, or basic multilingual plane) and
 | 
						|
possible 2^32 of them instead of the usual 256 and is sufficient to encode all
 | 
						|
of the world languages at once. More details about Unicode may be found at
 | 
						|
<http://www.unicode.org/>.
 | 
						|
 | 
						|
As this solution is obviously preferable to the previous ones (think of
 | 
						|
incompatible encodings for the same language, locale chaos and so on), many
 | 
						|
modern operating systems support it. The probably first example is Windows NT
 | 
						|
which uses only Unicode internally since its very first version.
 | 
						|
 | 
						|
Writing internationalized programs is much easier with Unicode and, as the
 | 
						|
support for it improves, it should become more and more so. Moreover, in the
 | 
						|
Windows NT/2000 case, even the program which uses only standard ASCII can
 | 
						|
profit from using Unicode because they will work more efficiently - there will
 | 
						|
be no need for the system to convert all strings the program uses to/from
 | 
						|
Unicode each time a system call is made.
 | 
						|
 | 
						|
 | 
						|
@section overview_unicode_ansi Unicode and ANSI Modes
 | 
						|
 | 
						|
As not all platforms supported by wxWidgets support Unicode (fully) yet, in
 | 
						|
many cases it is unwise to write a program which can only work in Unicode
 | 
						|
environment. A better solution is to write programs in such way that they may
 | 
						|
be compiled either in ANSI (traditional) mode or in the Unicode one.
 | 
						|
 | 
						|
This can be achieved quite simply by using the means provided by wxWidgets.
 | 
						|
Basically, there are only a few things to watch out for:
 | 
						|
 | 
						|
- Character type (@c char or @c wchar_t)
 | 
						|
- Literal strings (i.e. @c "Hello, world!" or @c '*')
 | 
						|
- String functions (@c strlen(), @c strcpy(), ...)
 | 
						|
- Special preprocessor tokens (@c __FILE__, @c __DATE__ and @c __TIME__)
 | 
						|
 | 
						|
Let's look at them in order. First of all, each character in an Unicode program
 | 
						|
takes 2 bytes instead of usual one, so another type should be used to store the
 | 
						|
characters (@c char only holds 1 byte usually). This type is called @c wchar_t
 | 
						|
which stands for @e wide-character type.
 | 
						|
 | 
						|
Also, the string and character constants should be encoded using wide
 | 
						|
characters (@c wchar_t type) which typically take 2 or 4 bytes instead of
 | 
						|
@c char which only takes one. This is achieved by using the standard C (and
 | 
						|
C++) way: just put the letter @c 'L' after any string constant and it becomes a
 | 
						|
@e long constant, i.e. a wide character one. To make things a bit more
 | 
						|
readable, you are also allowed to prefix the constant with @c 'L' instead of
 | 
						|
putting it after it.
 | 
						|
 | 
						|
Of course, the usual standard C functions don't work with @c wchar_t strings,
 | 
						|
so another set of functions exists which do the same thing but accept
 | 
						|
@c wchar_t* instead of @c char*. For example, a function to get the length of a
 | 
						|
wide-character string is called @c wcslen() (compare with @c strlen() - you see
 | 
						|
that the only difference is that the "str" prefix standing for "string" has
 | 
						|
been replaced with "wcs" standing for "wide-character string").
 | 
						|
 | 
						|
And finally, the standard preprocessor tokens enumerated above expand to ANSI
 | 
						|
strings but it is more likely that Unicode strings are wanted in the Unicode
 | 
						|
build. wxWidgets provides the macros @c __TFILE__, @c __TDATE__ and
 | 
						|
@c __TTIME__ which behave exactly as the standard ones except that they produce
 | 
						|
ANSI strings in ANSI build and Unicode ones in the Unicode build.
 | 
						|
 | 
						|
To summarize, here is a brief example of how a program which can be compiled
 | 
						|
in both ANSI and Unicode modes could look like:
 | 
						|
 | 
						|
@code
 | 
						|
#ifdef __UNICODE__
 | 
						|
    wchar_t wch = L'*';
 | 
						|
    const wchar_t *ws = L"Hello, world!";
 | 
						|
    int len = wcslen(ws);
 | 
						|
 | 
						|
    wprintf(L"Compiled at %s\n", __TDATE__);
 | 
						|
#else // ANSI
 | 
						|
    char ch = '*';
 | 
						|
    const char *s = "Hello, world!";
 | 
						|
    int len = strlen(s);
 | 
						|
 | 
						|
    printf("Compiled at %s\n", __DATE__);
 | 
						|
#endif // Unicode/ANSI
 | 
						|
@endcode
 | 
						|
 | 
						|
Of course, it would be nearly impossibly to write such programs if it had to
 | 
						|
be done this way (try to imagine the number of UNICODE checkes an average
 | 
						|
program would have had!). Luckily, there is another way - see the next section.
 | 
						|
 | 
						|
 | 
						|
@section overview_unicode_supportin Unicode Support in wxWidgets
 | 
						|
 | 
						|
In wxWidgets, the code fragment from above should be written instead:
 | 
						|
 | 
						|
@code
 | 
						|
wxChar ch = wxT('*');
 | 
						|
wxString s = wxT("Hello, world!");
 | 
						|
int len = s.Len();
 | 
						|
@endcode
 | 
						|
 | 
						|
What happens here? First of all, you see that there are no more UNICODE checks
 | 
						|
at all. Instead, we define some types and macros which behave differently in
 | 
						|
the Unicode and ANSI builds and allow us to avoid using conditional compilation
 | 
						|
in the program itself.
 | 
						|
 | 
						|
We have a @c wxChar type which maps either on @c char or @c wchar_t depending
 | 
						|
on the mode in which program is being compiled. There is no need for a separate
 | 
						|
type for strings though, because the standard wxString supports Unicode, i.e.
 | 
						|
it stores either ANSI or Unicode strings depending on the compile mode.
 | 
						|
 | 
						|
Finally, there is a special wxT() macro which should enclose all literal
 | 
						|
strings in the program. As it is easy to see comparing the last fragment with
 | 
						|
the one above, this macro expands to nothing in the (usual) ANSI mode and
 | 
						|
prefixes @c 'L' to its argument in the Unicode mode.
 | 
						|
 | 
						|
The important conclusion is that if you use @c wxChar instead of @c char, avoid
 | 
						|
using C style strings and use @c wxString instead and don't forget to enclose
 | 
						|
all string literals inside wxT() macro, your program automatically becomes
 | 
						|
(almost) Unicode compliant!
 | 
						|
 | 
						|
Just let us state once again the rules:
 | 
						|
 | 
						|
@li Always use wxChar instead of @c char
 | 
						|
@li Always enclose literal string constants in wxT() macro unless they're
 | 
						|
    already converted to the right representation (another standard wxWidgets
 | 
						|
    macro _() does it, for example, so there is no need for wxT() in this case)
 | 
						|
    or you intend to pass the constant directly to an external function which
 | 
						|
    doesn't accept wide-character strings.
 | 
						|
@li Use wxString instead of C style strings.
 | 
						|
 | 
						|
 | 
						|
@section overview_unicode_supportout Unicode and the Outside World
 | 
						|
 | 
						|
We have seen that it was easy to write Unicode programs using wxWidgets types
 | 
						|
and macros, but it has been also mentioned that it isn't quite enough. Although
 | 
						|
everything works fine inside the program, things can get nasty when it tries to
 | 
						|
communicate with the outside world which, sadly, often expects ANSI strings (a
 | 
						|
notable exception is the entire Win32 API which accepts either Unicode or ANSI
 | 
						|
strings and which thus makes it unnecessary to ever perform any conversions in
 | 
						|
the program). GTK 2.0 only accepts UTF-8 strings.
 | 
						|
 | 
						|
To get an ANSI string from a wxString, you may use the mb_str() function which
 | 
						|
always returns an ANSI string (independently of the mode - while the usual
 | 
						|
c_str() returns a pointer to the internal representation which is either ASCII
 | 
						|
or Unicode). More rarely used, but still useful, is wc_str() function which
 | 
						|
always returns the Unicode string.
 | 
						|
 | 
						|
Sometimes it is also necessary to go from ANSI strings to wxStrings. In this
 | 
						|
case, you can use the converter-constructor, as follows:
 | 
						|
 | 
						|
@code
 | 
						|
const char* ascii_str = "Some text";
 | 
						|
wxString str(ascii_str, wxConvUTF8);
 | 
						|
@endcode
 | 
						|
 | 
						|
This code also compiles fine under a non-Unicode build of wxWidgets, but in
 | 
						|
that case the converter is ignored.
 | 
						|
 | 
						|
For more information about converters and Unicode see the @ref overview_mbconv.
 | 
						|
 | 
						|
 | 
						|
@section overview_unicode_settings Unicode Related Compilation Settings
 | 
						|
 | 
						|
You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode
 | 
						|
mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you compile
 | 
						|
your program in ANSI mode you can still define @c wxUSE_WCHAR_T to get some
 | 
						|
limited support for @c wchar_t type.
 | 
						|
 | 
						|
This will allow your program to perform conversions between Unicode strings and
 | 
						|
ANSI ones (using @ref overview_mbconv "wxMBConv") and construct wxString
 | 
						|
objects from Unicode strings (presumably read from some external file or
 | 
						|
elsewhere).
 | 
						|
 | 
						|
 | 
						|
@section overview_unicode_traps Traps for the Unwary
 | 
						|
 | 
						|
@li Casting c_str() to void* is now char*, not wxChar*
 | 
						|
@li Passing c_str(), mb_str() or wc_str() to variadic functions doesn't work.
 | 
						|
 | 
						|
*/
 | 
						|
 |