diff --git a/docs/doxygen/overviews/string.h b/docs/doxygen/overviews/string.h index e493eaf9dc..27adb0136f 100644 --- a/docs/doxygen/overviews/string.h +++ b/docs/doxygen/overviews/string.h @@ -30,22 +30,23 @@ in previous versions. @section overview_string_internal Internal wxString Encoding -Since wxWidgets 3.0 wxString internally uses UTF-16 (with Unicode -code units stored in @c wchar_t) under Windows and UTF-8 (with Unicode -code units stored in @c char) under Unix, Linux and Mac OS X to store its content. +Since wxWidgets 3.0 wxString may use any of @c UTF-16 (under Windows, using +the native 16 bit @c wchar_t), @c UTF-32 (under Unix, using the native 32 +bit @c wchar_t) or @c UTF-8 (under both Windows and Unix) to store its +content. By default, @c wchar_t is used under all platforms, but wxWidgets can +be compiled with wxUSE_UNICODE_UTF8=1 to use UTF-8. -For definitions of code units and code points terms, please -see the @ref overview_unicode_encodings paragraph. - -For simplicity of implementation, wxString when wxUSE_UNICODE_WCHAR==1 -(e.g. on Windows) uses per code unit indexing instead of -per code point indexing and doesn't know anything about surrogate pairs; -in other words it always considers code points to be composed by 1 code unit, -while this is really true only for characters in the @e BMP (Basic Multilingual Plane). -Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user -code has to take care of surrogate pairs himself. -(Note however that Windows itself has built-in support for surrogate pairs in UTF-16, -such as for drawing strings on screen.) +For simplicity of implementation, wxString uses per code unit indexing +instead of per code point indexing when using UTF-16, i.e. in the +default wxUSE_UNICODE_WCHAR==1 build under Windows and doesn't know +anything about surrogate pairs. In other words it always considers code points +to be composed by 1 code unit, while this is really true only for characters in +the @e BMP (Basic Multilingual Plane), as explained in more details in the @ref +overview_unicode_encodings section. Thus when iterating over a UTF-16 string +stored in a wxString under Windows, the user code has to take care of +surrogate pairs himself. (Note however that Windows itself has +built-in support for surrogate pairs in UTF-16, such as for drawing strings on +screen.) @remarks Note that while the behaviour of wxString when wxUSE_UNICODE_WCHAR==1 @@ -54,10 +55,10 @@ UCS-2 encoded since you can encode code points outside the @e BMP in a wxString as two code units (i.e. as a surrogate pair; as already mentioned however wxString will "see" them as two different code points) -When instead wxUSE_UNICODE_UTF8==1 (e.g. on Linux and Mac OS X) -wxString handles UTF8 multi-bytes sequences just fine also for characters outside -the BMP (it implements per code point indexing), so that you can use -UTF8 in a completely transparent way: +In wxUSE_UNICODE_UTF8==1 case, wxString handles UTF-8 multi-bytes +sequences just fine also for characters outside the BMP (it implements per +code point indexing), so that you can use UTF-8 in a completely transparent +way: Example: @code @@ -361,17 +362,18 @@ difference the change to @c EXTRA_ALLOC makes to your program. @section overview_string_settings wxString Related Compilation Settings -Much work has been done to make existing code using ANSI string literals -work as before version 3.0. +The main option affecting wxString is @c wxUSE_UNICODE which is now always +defined as @c 1 by default to indicate Unicode support. You may set it to 0 to +disable Unicode support in wxString and elsewhere in wxWidgets but this is @e +strongly not recommended. -If you nonetheless need to have a wxString that uses @c wchar_t -on Unix and Linux, too, you can specify this on the command line with the -@c configure @c --disable-utf8 switch or you can consider using wxUString -or @c std::wstring instead. +Another option affecting wxWidgets is @c wxUSE_UNICODE_WCHAR which is also 1 by +default. You may want to set it to 0 and set @c wxUSE_UNICODE_UTF8 to 1 instead +to use UTF-8 internally. wxString still provides the same API in this case, but +using UTF-8 has performance implications as explained in @ref +overview_unicode_performance, so it probably shouldn't be enabled for legacy +code which might contain a lot of index-using loops. -@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support. -If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is -also defined, otherwise @c wxUSE_UNICODE_WCHAR is. -See also @ref page_wxusedef_important. +See also @ref page_wxusedef_important for a few other options affecting wxString. */ diff --git a/docs/doxygen/overviews/unicode.h b/docs/doxygen/overviews/unicode.h index 2c2ff51031..cfec156ede 100644 --- a/docs/doxygen/overviews/unicode.h +++ b/docs/doxygen/overviews/unicode.h @@ -58,7 +58,7 @@ Note that typically one character is assigned exactly one code point, but there are exceptions; the so-called precomposed characters (see http://en.wikipedia.org/wiki/Precomposed_character) or the ligatures. In these cases a single "character" may be mapped to more than one code point or -viceversa more characters may be mapped to a single code point. +vice versa more than one character may be mapped to a single code point. The Unicode standard divides the space of all possible code points in planes; a plane is a range of 65,536 (1000016) contiguous Unicode code points.