added a overview_string_binary section describing what is wxString support with regard to binary data; removed traces of UCS2 wording; it was not completely correct (see wx-dev thread 'string changes doubts and docs')
git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@57204 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
This commit is contained in:
Binary file not shown.
Binary file not shown.
Before Width: | Height: | Size: 16 KiB After Width: | Height: | Size: 15 KiB |
Binary file not shown.
Binary file not shown.
Before Width: | Height: | Size: 63 KiB After Width: | Height: | Size: 68 KiB |
@@ -14,6 +14,7 @@ Classes: wxString, wxArrayString, wxStringTokenizer
|
|||||||
|
|
||||||
@li @ref overview_string_intro
|
@li @ref overview_string_intro
|
||||||
@li @ref overview_string_internal
|
@li @ref overview_string_internal
|
||||||
|
@li @ref overview_string_binary
|
||||||
@li @ref overview_string_comparison
|
@li @ref overview_string_comparison
|
||||||
@li @ref overview_string_advice
|
@li @ref overview_string_advice
|
||||||
@li @ref overview_string_related
|
@li @ref overview_string_related
|
||||||
@@ -27,16 +28,12 @@ Classes: wxString, wxArrayString, wxStringTokenizer
|
|||||||
@section overview_string_intro Introduction
|
@section overview_string_intro Introduction
|
||||||
|
|
||||||
wxString is a class which represents a Unicode string of arbitrary length and
|
wxString is a class which represents a Unicode string of arbitrary length and
|
||||||
containing arbitrary characters.
|
containing arbitrary Unicode characters.
|
||||||
|
|
||||||
The @c NUL character is allowed, but be
|
|
||||||
aware that in the current string implementation some methods might not work
|
|
||||||
correctly in this case. @todo still true?
|
|
||||||
|
|
||||||
This class has all the standard operations you can expect to find in a string
|
This class has all the standard operations you can expect to find in a string
|
||||||
class: dynamic memory management (string extends to accommodate new
|
class: dynamic memory management (string extends to accommodate new
|
||||||
characters), construction from other strings, C strings, wide character C strings
|
characters), construction from other strings, compatibility with C strings and
|
||||||
and characters, assignment operators, access to individual characters, string
|
wide character C strings, assignment operators, access to individual characters, string
|
||||||
concatenation and comparison, substring extraction, case conversion, trimming and
|
concatenation and comparison, substring extraction, case conversion, trimming and
|
||||||
padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf)
|
padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf)
|
||||||
and stream-like insertion functions as well as much more - see wxString for a
|
and stream-like insertion functions as well as much more - see wxString for a
|
||||||
@@ -49,28 +46,31 @@ in previous versions.
|
|||||||
|
|
||||||
@section overview_string_internal Internal wxString encoding
|
@section overview_string_internal Internal wxString encoding
|
||||||
|
|
||||||
Since wxWidgets 3.0 wxString internally uses <b>UCS-2</b> (with Unicode
|
Since wxWidgets 3.0 wxString internally uses <b>UTF-16</b> (with Unicode
|
||||||
code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode
|
code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode
|
||||||
code units stored in @c char) under Unix, Linux and Mac OS X to store its content.
|
code units stored in @c char) under Unix, Linux and Mac OS X to store its content.
|
||||||
|
|
||||||
For definitions of <em>code units</em> and <em>code points</em> terms, please
|
For definitions of <em>code units</em> and <em>code points</em> terms, please
|
||||||
see the @ref overview_unicode_encodings paragraph.
|
see the @ref overview_unicode_encodings paragraph.
|
||||||
|
|
||||||
Note that there is a difference about UCS-2 and UTF-16: the first is a fixed-length
|
|
||||||
encoding, without <em>surrogate pairs</em>, while the latter is a
|
|
||||||
variable-length encoding. Except for this the two encodings are identical.
|
|
||||||
|
|
||||||
For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
|
For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
|
||||||
(e.g. on Windows) uses UCS-2 and thus doesn't know anything about surrogate pairs;
|
(e.g. on Windows) uses <em>per code unit indexing</em> instead of
|
||||||
it always consider 1 code unit per 1 code point, while this is really true only for
|
<em>per code point indexing</em> and doesn't know anything about surrogate pairs;
|
||||||
characters in the @e BMP (Basic Multilingual Plane).
|
in other words it always considers code points to be composed by 1 code point,
|
||||||
|
while this is really true only for characters in the @e BMP (Basic Multilingual Plane).
|
||||||
Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user
|
Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user
|
||||||
code has to take care of <em>surrogate pair</em> handling himself.
|
code has to take care of <em>surrogate pairs</em> himself.
|
||||||
(Note however that Windows itself has built-in support for surrogate pairs in UTF-16,
|
(Note however that Windows itself has built-in support for surrogate pairs in UTF-16,
|
||||||
such as for drawing strings on screen.)
|
such as for drawing strings on screen.)
|
||||||
|
|
||||||
|
@remarks
|
||||||
|
Note that while the behaviour of wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
|
||||||
|
resembles UCS-2 encoding, it's not completely correct to refer to wxString as
|
||||||
|
UCS-2 encoded since you can encode characters outside the @e BMP in a wxString.
|
||||||
|
|
||||||
When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X)
|
When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X)
|
||||||
wxString handles UTF8 multi-bytes sequences just fine, so that you can use
|
wxString handles UTF8 multi-bytes sequences just fine also for characters outside
|
||||||
|
the BMP (it implements <em>per code point indexing</em>), so that you can use
|
||||||
UTF8 in a completely transparent way:
|
UTF8 in a completely transparent way:
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
@@ -89,7 +89,7 @@ Example:
|
|||||||
wxPrintf("wxString reports a length of %d character(s)", test.length());
|
wxPrintf("wxString reports a length of %d character(s)", test.length());
|
||||||
// prints "wxString reports a length of 1 character(s)" on Linux
|
// prints "wxString reports a length of 1 character(s)" on Linux
|
||||||
// prints "wxString reports a length of 2 character(s)" on Windows
|
// prints "wxString reports a length of 2 character(s)" on Windows
|
||||||
// since Windows doesn't have surrogate pairs support!
|
// since wxString on Windows doesn't have surrogate pairs support!
|
||||||
|
|
||||||
|
|
||||||
// second test, this time using characters part of the Unicode BMP:
|
// second test, this time using characters part of the Unicode BMP:
|
||||||
@@ -113,16 +113,29 @@ above; it's composed by 3 characters and the final @c NULL:
|
|||||||
|
|
||||||
@image html overview_wxstring_encoding.png
|
@image html overview_wxstring_encoding.png
|
||||||
|
|
||||||
As you can see, UCS2/UTF16 encoding is straightforward (for characters in the @e BMP)
|
As you can see, UTF16 encoding is straightforward (for characters in the @e BMP)
|
||||||
and in this example the UCS2-encoded wxString takes 8 bytes.
|
and in this example the UTF16-encoded wxString takes 8 bytes.
|
||||||
UTF8 encoding is more elaborated and in this example takes 7 bytes.
|
UTF8 encoding is more elaborated and in this example takes 7 bytes.
|
||||||
|
|
||||||
The type used by wxString to store Unicode code units is called wxStringCharType.
|
|
||||||
|
|
||||||
In general, for strings containing many latin characters UTF8 provides a big
|
In general, for strings containing many latin characters UTF8 provides a big
|
||||||
advantage in memory footprint respect UTF16, but requires some more processing
|
advantage with regards to the memory footprint respect UTF16, but requires some
|
||||||
for common operations like e.g. length calculation.
|
more processing for common operations like e.g. length calculation.
|
||||||
|
|
||||||
|
Finally, note that the type used by wxString to store Unicode code units
|
||||||
|
(@c wchar_t or @c char) is always @c typedef-ined to be ::wxStringCharType.
|
||||||
|
|
||||||
|
|
||||||
|
@section overview_string_binary Using wxString to store binary data
|
||||||
|
|
||||||
|
wxString can be used to store binary data (even if it contains @c NULs) using the
|
||||||
|
functions wxString::To8BitData and wxString::From8BitData.
|
||||||
|
|
||||||
|
Beware that even if @c NUL character is allowed, in the current string implementation
|
||||||
|
some methods might not work correctly with them.
|
||||||
|
|
||||||
|
Note however that other classes like wxMemoryBuffer are more suited to this task.
|
||||||
|
For handling binary data you may also want to look at the wxStreamBuffer,
|
||||||
|
wxMemoryOutputStream, wxMemoryInputStream classes.
|
||||||
|
|
||||||
|
|
||||||
@section overview_string_comparison Comparison to Other String Classes
|
@section overview_string_comparison Comparison to Other String Classes
|
||||||
@@ -364,11 +377,16 @@ difference the change to @c EXTRA_ALLOC makes to your program.
|
|||||||
|
|
||||||
Much work has been done to make existing code using ANSI string literals
|
Much work has been done to make existing code using ANSI string literals
|
||||||
work as before version 3.0.
|
work as before version 3.0.
|
||||||
|
|
||||||
If you nonetheless need to have a wxString that uses @c wchar_t
|
If you nonetheless need to have a wxString that uses @c wchar_t
|
||||||
on Unix and Linux, too, you can specify this on the command line with the
|
on Unix and Linux, too, you can specify this on the command line with the
|
||||||
@c configure @c --disable-utf8 switch or you can consider using wxUString
|
@c configure @c --disable-utf8 switch or you can consider using wxUString
|
||||||
or @c std::wstring instead.
|
or @c std::wstring instead.
|
||||||
|
|
||||||
|
@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support.
|
||||||
|
If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
|
||||||
|
also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
|
||||||
|
See also @ref page_wxusedef_important.
|
||||||
|
|
||||||
*/
|
*/
|
||||||
|
|
||||||
|
@@ -49,8 +49,8 @@ other services should be ready to deal with Unicode.
|
|||||||
|
|
||||||
When working with Unicode, it's important to define the meaning of some terms.
|
When working with Unicode, it's important to define the meaning of some terms.
|
||||||
|
|
||||||
A <b><em>glyph</em></b> is a particular image that represents a character or part
|
A <b><em>glyph</em></b> is a particular image (usually part of a font) that
|
||||||
of a character.
|
represents a character or part of a character.
|
||||||
Any character may have one or more glyph associated; e.g. some of the possible
|
Any character may have one or more glyph associated; e.g. some of the possible
|
||||||
glyphs for the capital letter 'A' are:
|
glyphs for the capital letter 'A' are:
|
||||||
|
|
||||||
@@ -60,7 +60,13 @@ Unicode assigns each character of almost any existing alphabet/script a number,
|
|||||||
which is called <b><em>code point</em></b>; it's typically indicated in documentation
|
which is called <b><em>code point</em></b>; it's typically indicated in documentation
|
||||||
manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number.
|
manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number.
|
||||||
|
|
||||||
The Unicode standard divides the space of all possible code points in @e planes;
|
Note that typically one character is assigned exactly one code point, but there
|
||||||
|
are exceptions; the so-called <em>precomposed characters</em>
|
||||||
|
(see http://en.wikipedia.org/wiki/Precomposed_character) or the <em>ligatures</em>.
|
||||||
|
In these cases a single "character" may be mapped to more than one code point or
|
||||||
|
viceversa more characters may be mapped to a single code point.
|
||||||
|
|
||||||
|
The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>;
|
||||||
a plane is a range of 65,536 (1000016) contiguous Unicode code points.
|
a plane is a range of 65,536 (1000016) contiguous Unicode code points.
|
||||||
Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic
|
Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic
|
||||||
Multilingual Plane.
|
Multilingual Plane.
|
||||||
@@ -73,7 +79,7 @@ Code points are represented in computer memory as a sequence of one or more
|
|||||||
More precisely, a code unit is the minimal bit combination that can represent a
|
More precisely, a code unit is the minimal bit combination that can represent a
|
||||||
unit of encoded text for processing or interchange.
|
unit of encoded text for processing or interchange.
|
||||||
|
|
||||||
The @e UTF or Unicode Transformation Formats are algorithms mapping the Unicode
|
The <b><em>UTF</em></b> or Unicode Transformation Formats are algorithms mapping the Unicode
|
||||||
code points to code unit sequences. The simplest of them is <b>UTF-32</b> where
|
code points to code unit sequences. The simplest of them is <b>UTF-32</b> where
|
||||||
each code unit is composed by 32 bits (4 bytes) and each code point is always
|
each code unit is composed by 32 bits (4 bytes) and each code point is always
|
||||||
represented by a single code unit (fixed length encoding).
|
represented by a single code unit (fixed length encoding).
|
||||||
@@ -129,7 +135,7 @@ programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME.
|
|||||||
However, unlike the Unicode build mode of the previous versions of wxWidgets, this
|
However, unlike the Unicode build mode of the previous versions of wxWidgets, this
|
||||||
support is mostly transparent: you can still continue to work with the @b narrow
|
support is mostly transparent: you can still continue to work with the @b narrow
|
||||||
(i.e. current locale-encoded @c char*) strings even if @b wide
|
(i.e. current locale-encoded @c char*) strings even if @b wide
|
||||||
(i.e. UTF16/UCS2-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also
|
(i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also
|
||||||
supported. Any wxWidgets function accepts arguments of either type as both
|
supported. Any wxWidgets function accepts arguments of either type as both
|
||||||
kinds of strings are implicitly converted to wxString, so both
|
kinds of strings are implicitly converted to wxString, so both
|
||||||
@code
|
@code
|
||||||
@@ -386,7 +392,7 @@ function directly.
|
|||||||
|
|
||||||
@section overview_unicode_settings Unicode Related Compilation Settings
|
@section overview_unicode_settings Unicode Related Compilation Settings
|
||||||
|
|
||||||
@c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support.
|
@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support.
|
||||||
If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
|
If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
|
||||||
also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
|
also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user