Update Unicode overview to 3.0
git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@53092 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
This commit is contained in:
@@ -19,8 +19,6 @@ characters from languages other than English.
|
|||||||
@li @ref overview_unicode_supportin
|
@li @ref overview_unicode_supportin
|
||||||
@li @ref overview_unicode_supportout
|
@li @ref overview_unicode_supportout
|
||||||
@li @ref overview_unicode_settings
|
@li @ref overview_unicode_settings
|
||||||
@li @ref overview_unicode_traps
|
|
||||||
|
|
||||||
|
|
||||||
<hr>
|
<hr>
|
||||||
|
|
||||||
@@ -33,127 +31,101 @@ shortcomings of the previous, 8 bit standards, by using at least 16 (and
|
|||||||
possibly 32) bits for encoding each character. This allows to have at least
|
possibly 32) bits for encoding each character. This allows to have at least
|
||||||
65536 characters (what is called the BMP, or basic multilingual plane) and
|
65536 characters (what is called the BMP, or basic multilingual plane) and
|
||||||
possible 2^32 of them instead of the usual 256 and is sufficient to encode all
|
possible 2^32 of them instead of the usual 256 and is sufficient to encode all
|
||||||
of the world languages at once. More details about Unicode may be found at
|
of the world languages at once. A different approach is to encode all
|
||||||
<http://www.unicode.org/>.
|
strings in UTF8 which does not require the use of wide characters and
|
||||||
|
additionally is backwards compatible with 7-bit ASCII. The solution to
|
||||||
|
use UTF8 is prefered under Linux and partially OS X.
|
||||||
|
|
||||||
As this solution is obviously preferable to the previous ones (think of
|
More details about Unicode may be found at <http://www.unicode.org/>.
|
||||||
incompatible encodings for the same language, locale chaos and so on), many
|
|
||||||
modern operating systems support it. The probably first example is Windows NT
|
|
||||||
which uses only Unicode internally since its very first version.
|
|
||||||
|
|
||||||
Writing internationalized programs is much easier with Unicode and, as the
|
|
||||||
support for it improves, it should become more and more so. Moreover, in the
|
|
||||||
Windows NT/2000 case, even the program which uses only standard ASCII can
|
|
||||||
profit from using Unicode because they will work more efficiently - there will
|
|
||||||
be no need for the system to convert all strings the program uses to/from
|
|
||||||
Unicode each time a system call is made.
|
|
||||||
|
|
||||||
|
Writing internationalized programs is much easier with Unicode Moreover
|
||||||
|
even a program which uses only standard ASCII can benefit from using Unicode
|
||||||
|
for string representation because there will be no need to convert all
|
||||||
|
strings the program uses to/from Unicode each time a system call is made.
|
||||||
|
|
||||||
@section overview_unicode_ansi Unicode and ANSI Modes
|
@section overview_unicode_ansi Unicode and ANSI Modes
|
||||||
|
|
||||||
As not all platforms supported by wxWidgets support Unicode (fully) yet, in
|
Until wxWidgets 3.0 it was possible to compile the library both in
|
||||||
many cases it is unwise to write a program which can only work in Unicode
|
ANSI (=8-bit) mode as well as in wide char mode (16-bit per character
|
||||||
environment. A better solution is to write programs in such way that they may
|
on Windows and 32-but on most Unix versions, Linux and OS X). This
|
||||||
be compiled either in ANSI (traditional) mode or in the Unicode one.
|
has been changed in wxWidget with the removal of the ANSI mode.
|
||||||
|
|
||||||
This can be achieved quite simply by using the means provided by wxWidgets.
|
|
||||||
Basically, there are only a few things to watch out for:
|
|
||||||
|
|
||||||
- Character type (@c char or @c wchar_t)
|
|
||||||
- Literal strings (i.e. @c "Hello, world!" or @c '*')
|
|
||||||
- String functions (@c strlen(), @c strcpy(), ...)
|
|
||||||
- Special preprocessor tokens (@c __FILE__, @c __DATE__ and @c __TIME__)
|
|
||||||
|
|
||||||
Let's look at them in order. First of all, each character in an Unicode program
|
|
||||||
takes 2 bytes instead of usual one, so another type should be used to store the
|
|
||||||
characters (@c char only holds 1 byte usually). This type is called @c wchar_t
|
|
||||||
which stands for @e wide-character type.
|
|
||||||
|
|
||||||
Also, the string and character constants should be encoded using wide
|
|
||||||
characters (@c wchar_t type) which typically take 2 or 4 bytes instead of
|
|
||||||
@c char which only takes one. This is achieved by using the standard C (and
|
|
||||||
C++) way: just put the letter @c 'L' after any string constant and it becomes a
|
|
||||||
@e long constant, i.e. a wide character one. To make things a bit more
|
|
||||||
readable, you are also allowed to prefix the constant with @c 'L' instead of
|
|
||||||
putting it after it.
|
|
||||||
|
|
||||||
Of course, the usual standard C functions don't work with @c wchar_t strings,
|
|
||||||
so another set of functions exists which do the same thing but accept
|
|
||||||
@c wchar_t* instead of @c char*. For example, a function to get the length of a
|
|
||||||
wide-character string is called @c wcslen() (compare with @c strlen() - you see
|
|
||||||
that the only difference is that the "str" prefix standing for "string" has
|
|
||||||
been replaced with "wcs" standing for "wide-character string").
|
|
||||||
|
|
||||||
And finally, the standard preprocessor tokens enumerated above expand to ANSI
|
|
||||||
strings but it is more likely that Unicode strings are wanted in the Unicode
|
|
||||||
build. wxWidgets provides the macros @c __TFILE__, @c __TDATE__ and
|
|
||||||
@c __TTIME__ which behave exactly as the standard ones except that they produce
|
|
||||||
ANSI strings in ANSI build and Unicode ones in the Unicode build.
|
|
||||||
|
|
||||||
To summarize, here is a brief example of how a program which can be compiled
|
|
||||||
in both ANSI and Unicode modes could look like:
|
|
||||||
|
|
||||||
@code
|
|
||||||
#ifdef __UNICODE__
|
|
||||||
wchar_t wch = L'*';
|
|
||||||
const wchar_t *ws = L"Hello, world!";
|
|
||||||
int len = wcslen(ws);
|
|
||||||
|
|
||||||
wprintf(L"Compiled at %s\n", __TDATE__);
|
|
||||||
#else // ANSI
|
|
||||||
char ch = '*';
|
|
||||||
const char *s = "Hello, world!";
|
|
||||||
int len = strlen(s);
|
|
||||||
|
|
||||||
printf("Compiled at %s\n", __DATE__);
|
|
||||||
#endif // Unicode/ANSI
|
|
||||||
@endcode
|
|
||||||
|
|
||||||
Of course, it would be nearly impossibly to write such programs if it had to
|
|
||||||
be done this way (try to imagine the number of UNICODE checkes an average
|
|
||||||
program would have had!). Luckily, there is another way - see the next section.
|
|
||||||
|
|
||||||
|
|
||||||
@section overview_unicode_supportin Unicode Support in wxWidgets
|
@section overview_unicode_supportin Unicode Support in wxWidgets
|
||||||
|
|
||||||
In wxWidgets, the code fragment from above should be written instead:
|
Since wxWidgets 3.0 Unicode support is always enabled meaning
|
||||||
|
that the wxString class always uses Unicode to encode its content.
|
||||||
|
Under Windows wxString uses the standard Windows encoding UCS-2
|
||||||
|
(basically an array of 16-bit wchar_t). Under Unix and OS X however,
|
||||||
|
wxString uses UTF8 to encode its content.
|
||||||
|
|
||||||
|
For the programmer, the biggest change is that iterating over
|
||||||
|
a string can be slower than before since wxString has to parse
|
||||||
|
the entire string in order to find the n-th character in a
|
||||||
|
string, meaning that iterating over a string should no longer
|
||||||
|
be done by index but using iterators. Old code will still work
|
||||||
|
but might be less efficient.
|
||||||
|
|
||||||
|
Old code like this:
|
||||||
|
|
||||||
@code
|
@code
|
||||||
wxChar ch = wxT('*');
|
wxString s = wxT("hello");
|
||||||
wxString s = wxT("Hello, world!");
|
size_t i;
|
||||||
|
for (i = 0; i < s.Len(); i++)
|
||||||
|
{
|
||||||
|
wxChar ch = s[i];
|
||||||
|
|
||||||
|
// do something with it
|
||||||
|
}
|
||||||
|
@endcode
|
||||||
|
|
||||||
|
should be replaced (especially in time critical places) with:
|
||||||
|
|
||||||
|
@code
|
||||||
|
wxString s = "hello";
|
||||||
|
wxString::iterator i;
|
||||||
|
for (i = s.begin(); i != s.end(); ++i)
|
||||||
|
{
|
||||||
|
wxUniChar uni_ch = *i;
|
||||||
|
wxChar ch = uni_ch;
|
||||||
|
// same as: wxChar ch = *i
|
||||||
|
|
||||||
|
// do something with it
|
||||||
|
}
|
||||||
|
@endcode
|
||||||
|
|
||||||
|
If you want to replace individual characters in the string you
|
||||||
|
need to get a reference to that character:
|
||||||
|
|
||||||
|
@code
|
||||||
|
wxString s = "hello";
|
||||||
|
wxString::iterator i;
|
||||||
|
for (i = s.begin(); i != s.end(); ++i)
|
||||||
|
{
|
||||||
|
wxUniCharRef ch = *i;
|
||||||
|
ch = 'a';
|
||||||
|
// same as: *i = 'a';
|
||||||
|
}
|
||||||
|
@endcode
|
||||||
|
|
||||||
|
which will change the content of the wxString s from "hello" to "aaaaa".
|
||||||
|
|
||||||
|
String literals are translated to Unicode when they are assigned to
|
||||||
|
a wxString object so code can be written like this:
|
||||||
|
|
||||||
|
@code
|
||||||
|
wxString s = "Hello, world!";
|
||||||
int len = s.Len();
|
int len = s.Len();
|
||||||
@endcode
|
@endcode
|
||||||
|
|
||||||
What happens here? First of all, you see that there are no more UNICODE checks
|
wxWidgets provides wrappers around most Posix C functions (like printf(..))
|
||||||
at all. Instead, we define some types and macros which behave differently in
|
and the syntax has been adapted to support input with wxString, normal
|
||||||
the Unicode and ANSI builds and allow us to avoid using conditional compilation
|
C-style strings and wchar_t strings:
|
||||||
in the program itself.
|
|
||||||
|
|
||||||
We have a @c wxChar type which maps either on @c char or @c wchar_t depending
|
|
||||||
on the mode in which program is being compiled. There is no need for a separate
|
|
||||||
type for strings though, because the standard wxString supports Unicode, i.e.
|
|
||||||
it stores either ANSI or Unicode strings depending on the compile mode.
|
|
||||||
|
|
||||||
Finally, there is a special wxT() macro which should enclose all literal
|
|
||||||
strings in the program. As it is easy to see comparing the last fragment with
|
|
||||||
the one above, this macro expands to nothing in the (usual) ANSI mode and
|
|
||||||
prefixes @c 'L' to its argument in the Unicode mode.
|
|
||||||
|
|
||||||
The important conclusion is that if you use @c wxChar instead of @c char, avoid
|
|
||||||
using C style strings and use @c wxString instead and don't forget to enclose
|
|
||||||
all string literals inside wxT() macro, your program automatically becomes
|
|
||||||
(almost) Unicode compliant!
|
|
||||||
|
|
||||||
Just let us state once again the rules:
|
|
||||||
|
|
||||||
@li Always use wxChar instead of @c char
|
|
||||||
@li Always enclose literal string constants in wxT() macro unless they're
|
|
||||||
already converted to the right representation (another standard wxWidgets
|
|
||||||
macro _() does it, for example, so there is no need for wxT() in this case)
|
|
||||||
or you intend to pass the constant directly to an external function which
|
|
||||||
doesn't accept wide-character strings.
|
|
||||||
@li Use wxString instead of C style strings.
|
|
||||||
|
|
||||||
|
@code
|
||||||
|
wxString s;
|
||||||
|
s.Printf( "%s %s %s", "hello1", L"hello2", wxString("hello3") );
|
||||||
|
wxPrintf( "Three times hello %s\n", s );
|
||||||
|
@endcode
|
||||||
|
|
||||||
@section overview_unicode_supportout Unicode and the Outside World
|
@section overview_unicode_supportout Unicode and the Outside World
|
||||||
|
|
||||||
@@ -179,29 +151,14 @@ const char* ascii_str = "Some text";
|
|||||||
wxString str(ascii_str, wxConvUTF8);
|
wxString str(ascii_str, wxConvUTF8);
|
||||||
@endcode
|
@endcode
|
||||||
|
|
||||||
This code also compiles fine under a non-Unicode build of wxWidgets, but in
|
|
||||||
that case the converter is ignored.
|
|
||||||
|
|
||||||
For more information about converters and Unicode see the @ref overview_mbconv.
|
For more information about converters and Unicode see the @ref overview_mbconv.
|
||||||
|
|
||||||
|
|
||||||
@section overview_unicode_settings Unicode Related Compilation Settings
|
@section overview_unicode_settings Unicode Related Compilation Settings
|
||||||
|
|
||||||
You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode
|
You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode
|
||||||
mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you compile
|
mode. Since wxWidgets 3.0 this is always the case. When compiled in UTF8
|
||||||
your program in ANSI mode you can still define @c wxUSE_WCHAR_T to get some
|
mode @c wxUSE_UNICODE_UTF8 is also defined.
|
||||||
limited support for @c wchar_t type.
|
|
||||||
|
|
||||||
This will allow your program to perform conversions between Unicode strings and
|
|
||||||
ANSI ones (using @ref overview_mbconv "wxMBConv") and construct wxString
|
|
||||||
objects from Unicode strings (presumably read from some external file or
|
|
||||||
elsewhere).
|
|
||||||
|
|
||||||
|
|
||||||
@section overview_unicode_traps Traps for the Unwary
|
|
||||||
|
|
||||||
@li Casting c_str() to void* is now char*, not wxChar*
|
|
||||||
@li Passing c_str(), mb_str() or wc_str() to variadic functions doesn't work.
|
|
||||||
|
|
||||||
*/
|
*/
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user