Update wxString Unicode documentation to reflect the default wchar_t use.
And other minor fixes to Unicode-related documentation. git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@75031 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
This commit is contained in:
@@ -30,22 +30,23 @@ in previous versions.
|
|||||||
|
|
||||||
@section overview_string_internal Internal wxString Encoding
|
@section overview_string_internal Internal wxString Encoding
|
||||||
|
|
||||||
Since wxWidgets 3.0 wxString internally uses <b>UTF-16</b> (with Unicode
|
Since wxWidgets 3.0 wxString may use any of @c UTF-16 (under Windows, using
|
||||||
code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode
|
the native 16 bit @c wchar_t), @c UTF-32 (under Unix, using the native 32
|
||||||
code units stored in @c char) under Unix, Linux and Mac OS X to store its content.
|
bit @c wchar_t) or @c UTF-8 (under both Windows and Unix) to store its
|
||||||
|
content. By default, @c wchar_t is used under all platforms, but wxWidgets can
|
||||||
|
be compiled with <tt>wxUSE_UNICODE_UTF8=1</tt> to use UTF-8.
|
||||||
|
|
||||||
For definitions of <em>code units</em> and <em>code points</em> terms, please
|
For simplicity of implementation, wxString uses <em>per code unit indexing</em>
|
||||||
see the @ref overview_unicode_encodings paragraph.
|
instead of <em>per code point indexing</em> when using UTF-16, i.e. in the
|
||||||
|
default <tt>wxUSE_UNICODE_WCHAR==1</tt> build under Windows and doesn't know
|
||||||
For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
|
anything about surrogate pairs. In other words it always considers code points
|
||||||
(e.g. on Windows) uses <em>per code unit indexing</em> instead of
|
to be composed by 1 code unit, while this is really true only for characters in
|
||||||
<em>per code point indexing</em> and doesn't know anything about surrogate pairs;
|
the @e BMP (Basic Multilingual Plane), as explained in more details in the @ref
|
||||||
in other words it always considers code points to be composed by 1 code unit,
|
overview_unicode_encodings section. Thus when iterating over a UTF-16 string
|
||||||
while this is really true only for characters in the @e BMP (Basic Multilingual Plane).
|
stored in a wxString under Windows, the user code has to take care of
|
||||||
Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user
|
<em>surrogate pairs</em> himself. (Note however that Windows itself has
|
||||||
code has to take care of <em>surrogate pairs</em> himself.
|
built-in support for surrogate pairs in UTF-16, such as for drawing strings on
|
||||||
(Note however that Windows itself has built-in support for surrogate pairs in UTF-16,
|
screen.)
|
||||||
such as for drawing strings on screen.)
|
|
||||||
|
|
||||||
@remarks
|
@remarks
|
||||||
Note that while the behaviour of wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
|
Note that while the behaviour of wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
|
||||||
@@ -54,10 +55,10 @@ UCS-2 encoded since you can encode code points outside the @e BMP in a wxString
|
|||||||
as two code units (i.e. as a surrogate pair; as already mentioned however wxString
|
as two code units (i.e. as a surrogate pair; as already mentioned however wxString
|
||||||
will "see" them as two different code points)
|
will "see" them as two different code points)
|
||||||
|
|
||||||
When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X)
|
In <tt>wxUSE_UNICODE_UTF8==1</tt> case, wxString handles UTF-8 multi-bytes
|
||||||
wxString handles UTF8 multi-bytes sequences just fine also for characters outside
|
sequences just fine also for characters outside the BMP (it implements <em>per
|
||||||
the BMP (it implements <em>per code point indexing</em>), so that you can use
|
code point indexing</em>), so that you can use UTF-8 in a completely transparent
|
||||||
UTF8 in a completely transparent way:
|
way:
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
@code
|
@code
|
||||||
@@ -361,17 +362,18 @@ difference the change to @c EXTRA_ALLOC makes to your program.
|
|||||||
|
|
||||||
@section overview_string_settings wxString Related Compilation Settings
|
@section overview_string_settings wxString Related Compilation Settings
|
||||||
|
|
||||||
Much work has been done to make existing code using ANSI string literals
|
The main option affecting wxString is @c wxUSE_UNICODE which is now always
|
||||||
work as before version 3.0.
|
defined as @c 1 by default to indicate Unicode support. You may set it to 0 to
|
||||||
|
disable Unicode support in wxString and elsewhere in wxWidgets but this is @e
|
||||||
|
strongly not recommended.
|
||||||
|
|
||||||
If you nonetheless need to have a wxString that uses @c wchar_t
|
Another option affecting wxWidgets is @c wxUSE_UNICODE_WCHAR which is also 1 by
|
||||||
on Unix and Linux, too, you can specify this on the command line with the
|
default. You may want to set it to 0 and set @c wxUSE_UNICODE_UTF8 to 1 instead
|
||||||
@c configure @c --disable-utf8 switch or you can consider using wxUString
|
to use UTF-8 internally. wxString still provides the same API in this case, but
|
||||||
or @c std::wstring instead.
|
using UTF-8 has performance implications as explained in @ref
|
||||||
|
overview_unicode_performance, so it probably shouldn't be enabled for legacy
|
||||||
|
code which might contain a lot of index-using loops.
|
||||||
|
|
||||||
@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support.
|
See also @ref page_wxusedef_important for a few other options affecting wxString.
|
||||||
If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
|
|
||||||
also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
|
|
||||||
See also @ref page_wxusedef_important.
|
|
||||||
|
|
||||||
*/
|
*/
|
||||||
|
@@ -58,7 +58,7 @@ Note that typically one character is assigned exactly one code point, but there
|
|||||||
are exceptions; the so-called <em>precomposed characters</em>
|
are exceptions; the so-called <em>precomposed characters</em>
|
||||||
(see http://en.wikipedia.org/wiki/Precomposed_character) or the <em>ligatures</em>.
|
(see http://en.wikipedia.org/wiki/Precomposed_character) or the <em>ligatures</em>.
|
||||||
In these cases a single "character" may be mapped to more than one code point or
|
In these cases a single "character" may be mapped to more than one code point or
|
||||||
viceversa more characters may be mapped to a single code point.
|
vice versa more than one character may be mapped to a single code point.
|
||||||
|
|
||||||
The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>;
|
The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>;
|
||||||
a plane is a range of 65,536 (1000016) contiguous Unicode code points.
|
a plane is a range of 65,536 (1000016) contiguous Unicode code points.
|
||||||
|
Reference in New Issue
Block a user