Expand wxString overview and document some problems due to its dual nature.
Explain the possible problems with wxString due to its dual Unicode/ASCII nature. Also document the various conversions in the overview itself. Closes #14694. git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@73905 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
This commit is contained in:
@@ -10,34 +10,275 @@
|
|||||||
/**
|
/**
|
||||||
@class wxString
|
@class wxString
|
||||||
|
|
||||||
The wxString class has been completely rewritten for wxWidgets 3.0
|
String class for passing textual data to or receiving it from wxWidgets.
|
||||||
and this change was actually the main reason for the calling that
|
|
||||||
version wxWidgets 3.0.
|
|
||||||
|
|
||||||
wxString is a class representing a Unicode character string.
|
@note
|
||||||
wxString uses @c std::basic_string internally (even if @c wxUSE_STL is not defined)
|
While the use of wxString is unavoidable in wxWidgets program, you are
|
||||||
to store its content (unless this is not supported by the compiler or disabled
|
encouraged to use the standard string classes @c std::string or @c
|
||||||
specifically when building wxWidgets) and it therefore inherits
|
std::wstring in your applications and convert them to and from wxString
|
||||||
many features from @c std::basic_string. (Note that most implementations of
|
only when interacting with wxWidgets.
|
||||||
@c std::basic_string are thread-safe and don't use reference counting.)
|
|
||||||
|
|
||||||
These @c std::basic_string standard functions are only listed here, but
|
|
||||||
they are not fully documented in this manual; see the STL documentation
|
|
||||||
(http://www.cppreference.com/wiki/string/start) for more info.
|
|
||||||
The behaviour of all these functions is identical to the behaviour
|
|
||||||
described there.
|
|
||||||
|
|
||||||
You may notice that wxString sometimes has several functions which do
|
wxString is a class representing a Unicode character string but with
|
||||||
the same thing like Length(), Len() and length() which all return the
|
methods taking or returning both @c wchar_t wide characters and @c wchar_t*
|
||||||
string length. In all cases of such duplication the @c std::string
|
wide strings and traditional @c char characters and @c char* strings. The
|
||||||
compatible methods should be used.
|
dual nature of wxString API makes it simple to use in all cases and,
|
||||||
|
importantly, allows the code written for either ANSI or Unicode builds of
|
||||||
|
the previous wxWidgets versions to compile and work correctly with the
|
||||||
|
single unified Unicode build of wxWidgets 3.0. It is also mostly
|
||||||
|
transparent when using wxString with the few exceptions described below.
|
||||||
|
|
||||||
For informations about the internal encoding used by wxString and
|
|
||||||
for important warnings and advices for using it, please read
|
|
||||||
the @ref overview_string.
|
|
||||||
|
|
||||||
Since wxWidgets 3.0 wxString always stores Unicode strings, so you should
|
@section string_api API overview
|
||||||
be sure to read also @ref overview_unicode.
|
|
||||||
|
wxString tries to be similar to both @c std::string and @c std::wstring and
|
||||||
|
can mostly be used as either class. It provides practically all of the
|
||||||
|
methods of these classes, which behave exactly the same as in the standard
|
||||||
|
C++, and so are not documented here (please see any standard library
|
||||||
|
documentation, for example http://en.cppreference.com/w/cpp/string for more
|
||||||
|
details).
|
||||||
|
|
||||||
|
In addition to these standard methods, wxString adds functions dealing with
|
||||||
|
the conversions between different string encodings, described below, as
|
||||||
|
well as many extra helpers such as functions for formatted output
|
||||||
|
(Printf(), Format(), ...), case conversion (MakeUpper(), Capitalize(), ...)
|
||||||
|
and various others (Trim(), StartsWith(), Matches(), ...). All of the
|
||||||
|
non-standard methods follow wxWidgets "CamelCase" naming convention and are
|
||||||
|
documented here.
|
||||||
|
|
||||||
|
Notice that some wxString methods exist in several versions for
|
||||||
|
compatibility reasons. For example all of length(), Length() and Len() are
|
||||||
|
provided. In such cases it is recommended to use the standard string-like
|
||||||
|
method, i.e. length() in this case.
|
||||||
|
|
||||||
|
|
||||||
|
@section string_conv Converting to and from wxString
|
||||||
|
|
||||||
|
wxString can be created from:
|
||||||
|
- ASCII string guaranteed to contain only 7 bit characters using
|
||||||
|
wxString::FromAscii().
|
||||||
|
- Narrow @c char* string in the current locale encoding using implicit
|
||||||
|
wxString::wxString(const char*) constructor.
|
||||||
|
- Narrow @c char* string in UTF-8 encoding using wxString::FromUTF8().
|
||||||
|
- Narrow @c char* string in the given encoding using
|
||||||
|
wxString::wxString(const char*, const wxMBConv&) constructor passing a
|
||||||
|
wxCSConv corresponding to the encoding as the second argument.
|
||||||
|
- Standard @c std::string using implicit wxString::wxString(const
|
||||||
|
std::string&) constructor. Notice that this constructor supposes that
|
||||||
|
the string contains data in the current locale encoding, use FromUTF8()
|
||||||
|
or the constructor taking wxMBConv if this is not the case.
|
||||||
|
- Wide @c wchar_t* string using implicit wxString::wxString(const
|
||||||
|
wchar_t*) constructor.
|
||||||
|
- Standard @c std::wstring using implicit wxString::wxString(const
|
||||||
|
std::wstring&) constructor.
|
||||||
|
|
||||||
|
Notice that many of the constructors are implicit, meaning that you don't
|
||||||
|
even need to write them at all to pass the existing string to some
|
||||||
|
wxWidgets function taking a wxString.
|
||||||
|
|
||||||
|
Similarly, wxString can be converted to:
|
||||||
|
- ASCII string using wxString::ToAscii(). This is a potentially
|
||||||
|
destructive operation as all non-ASCII string characters are replaced
|
||||||
|
with a placeholder character.
|
||||||
|
- String in the current locale encoding implicitly or using c_str() or
|
||||||
|
mb_str() methods. This is a potentially destructive operation as an @e
|
||||||
|
empty string is returned if the conversion fails.
|
||||||
|
- String in UTF-8 encoding using wxString::utf8_str().
|
||||||
|
- String in any given encoding using mb_str() with the appropriate
|
||||||
|
wxMBConv object. This is also a potentially destructive operation.
|
||||||
|
- Standard @c std::string using wxString::ToStdString(). The contents
|
||||||
|
of the returned string use the current locale encoding, so this
|
||||||
|
conversion is potentially destructive as well.
|
||||||
|
- Wide C string using wxString::wc_str().
|
||||||
|
- Standard @c std::wstring using wxString::ToStdWstring().
|
||||||
|
|
||||||
|
@note If you built wxWidgets with @c wxUSE_STL set to 1, the implicit
|
||||||
|
conversions to both narrow and wide C strings are disabled and replaced
|
||||||
|
with implicit conversions to @c std::string and @c std::wstring.
|
||||||
|
|
||||||
|
Please notice that the conversions marked as "potentially destructive"
|
||||||
|
above can result in loss of data if their result is not checked, so you
|
||||||
|
need to verify that converting the contents of a non-empty Unicode string
|
||||||
|
to a non-UTF-8 multibyte encoding results in non-empty string. The simplest
|
||||||
|
and best way to ensure that the conversion never fails is to always use
|
||||||
|
UTF-8.
|
||||||
|
|
||||||
|
|
||||||
|
@section string_gotchas Traps for the unwary
|
||||||
|
|
||||||
|
As mentioned above, wxString tries to be compatible with both narrow and
|
||||||
|
wide standard string classes and mostly does it transparently, but there
|
||||||
|
are some exceptions.
|
||||||
|
|
||||||
|
@subsection string_gotchas_element String element access
|
||||||
|
|
||||||
|
Some problems are caused by wxString::operator[]() which returns an object
|
||||||
|
of a special proxy class allowing to assign either a simple @c char or a @c
|
||||||
|
wchar_t to the given index. Because of this, the return type of this
|
||||||
|
operator is neither @c char nor @c wchar_t nor a reference to one of these
|
||||||
|
types but wxUniCharRef which is not a primitive type and hence can't be
|
||||||
|
used in the @c switch statement. So the following code does @e not compile
|
||||||
|
@code
|
||||||
|
wxString s(...);
|
||||||
|
switch ( s[n] ) {
|
||||||
|
case 'A':
|
||||||
|
...
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
@endcode
|
||||||
|
and you need to use
|
||||||
|
@code
|
||||||
|
switch ( s[n].GetValue() ) {
|
||||||
|
...
|
||||||
|
}
|
||||||
|
@endcode
|
||||||
|
instead. Alternatively, you can use an explicit cast:
|
||||||
|
@code
|
||||||
|
switch ( static_cast<char>(s[n]) ) {
|
||||||
|
...
|
||||||
|
}
|
||||||
|
@endcode
|
||||||
|
but notice that this will result in an assert failure if the character at
|
||||||
|
the given position is not representable as a single @c char in the current
|
||||||
|
encoding, so you may want to cast to @c int instead if non-ASCII values can
|
||||||
|
be used.
|
||||||
|
|
||||||
|
Another consequence of this unusual return type arises when it is used with
|
||||||
|
template deduction or C++11 @c auto keyword. Unlike with the normal
|
||||||
|
references which are deduced to be of the referenced type, the deduced type
|
||||||
|
for wxUniCharRef is wxUniCharRef itself. This results in potentially
|
||||||
|
unexpected behaviour, for example:
|
||||||
|
@code
|
||||||
|
wxString s("abc");
|
||||||
|
auto c = s[0];
|
||||||
|
c = 'x'; // Modifies the string!
|
||||||
|
wxASSERT( s == "xbc" );
|
||||||
|
@endcode
|
||||||
|
Due to this, either explicitly specify the variable type:
|
||||||
|
@code
|
||||||
|
int c = s[0];
|
||||||
|
c = 'x'; // Doesn't modify the string any more.
|
||||||
|
wxASSERT( s == "abc" );
|
||||||
|
@endcode
|
||||||
|
or explicitly convert the return value:
|
||||||
|
@code
|
||||||
|
auto c = s[0].GetValue();
|
||||||
|
c = 'x'; // Doesn't modify the string neither.
|
||||||
|
wxASSERT( s == "abc" );
|
||||||
|
@endcode
|
||||||
|
|
||||||
|
|
||||||
|
@subsection string_gotchas_conv Conversion to C string
|
||||||
|
|
||||||
|
A different class of problems happens due to the dual nature of the return
|
||||||
|
value of wxString::c_str() method, which is also used for implicit
|
||||||
|
conversions. The result of calls to this method is convertible to either
|
||||||
|
narrow @c char* string or wide @c wchar_t* string and so, again, has
|
||||||
|
neither the former nor the latter type. Usually, the correct type will be
|
||||||
|
chosen depending on how you use the result but sometimes the compiler can't
|
||||||
|
choose it because of an ambiguity, e.g.:
|
||||||
|
@code
|
||||||
|
// Some non-wxWidgets functions existing for both narrow and wide
|
||||||
|
// strings:
|
||||||
|
void dump_text(const char* text); // Version (1)
|
||||||
|
void dump_text(const wchar_t* text); // Version (2)
|
||||||
|
|
||||||
|
wxString s(...);
|
||||||
|
dump_text(s); // ERROR: ambiguity.
|
||||||
|
dump_text(s.c_str()); // ERROR: still ambiguous.
|
||||||
|
@endcode
|
||||||
|
In this case you need to explicitly convert to the type that you need to
|
||||||
|
use or use a different, non-ambiguous, conversion function (which is
|
||||||
|
usually the best choice):
|
||||||
|
@code
|
||||||
|
dump_text(static_cast<const char*>(s)); // OK, calls (1)
|
||||||
|
dump_text(static_cast<const wchar_t*>(s.c_str())); // OK, calls (2)
|
||||||
|
dump_text(s.mb_str()); // OK, calls (1)
|
||||||
|
dump_text(s.wc_str()); // OK, calls (2)
|
||||||
|
dump_text(s.wx_str()); // OK, calls ???
|
||||||
|
@endcode
|
||||||
|
|
||||||
|
@subsection string_vararg Using wxString with vararg functions
|
||||||
|
|
||||||
|
A special subclass of the problems arising due to the polymorphic nature of
|
||||||
|
wxString::c_str() result type happens when using functions taking an
|
||||||
|
arbitrary number of arguments, such as the standard @c printf(). Due to the
|
||||||
|
rules of the C++ language, the types for the "variable" arguments of such
|
||||||
|
functions are not specified and hence the compiler cannot convert wxString
|
||||||
|
objects, or the objects returned by wxString::c_str(), to these unknown
|
||||||
|
types automatically. Hence neither wxString objects nor the results of most
|
||||||
|
of the conversion functions can be passed as vararg arguments:
|
||||||
|
@code
|
||||||
|
// ALL EXAMPLES HERE DO NOT WORK, DO NOT USE THEM!
|
||||||
|
printf("Don't do this: %s", s);
|
||||||
|
printf("Don't do that: %s", s.c_str());
|
||||||
|
printf("Nor even this: %s", s.mb_str());
|
||||||
|
wprintf("And even not always this: %s", s.wc_str());
|
||||||
|
@endcode
|
||||||
|
Instead you need to either explicitly cast to the needed type:
|
||||||
|
@code
|
||||||
|
// These examples work but are not the best solution, see below.
|
||||||
|
printf("You can do this: %s", static_cast<const char*>(s));
|
||||||
|
printf("Or this: %s", static_cast<const char*>(s.c_str()));
|
||||||
|
printf("And this: %s", static_cast<const char*>(s.mb_str()));
|
||||||
|
wprintf("Or this: %s", static_cast<const wchar_t*>(s.wc_str()));
|
||||||
|
@endcode
|
||||||
|
But a better solution is to use wxWidgets-provided functions, if possible,
|
||||||
|
as is the case for @c printf family of functions:
|
||||||
|
@code
|
||||||
|
// This is the recommended way.
|
||||||
|
wxPrintf("You can do just this: %s", s);
|
||||||
|
wxPrintf("And this (but it is redundant): %s", s.c_str());
|
||||||
|
wxPrintf("And this (not using Unicode): %s", s.mb_str());
|
||||||
|
wxPrintf("And this (always Unicode): %s", s.wc_str());
|
||||||
|
@endcode
|
||||||
|
Notice that wxPrintf() replaces both @c printf() and @c wprintf() and
|
||||||
|
accepts wxString objects, results of c_str() calls but also @c char* and
|
||||||
|
@c wchar_t* strings directly.
|
||||||
|
|
||||||
|
wxWidgets provides wx-prefixed equivalents to all the standard vararg
|
||||||
|
functions and a few more, notably wxString::Format(), wxLogMessage(),
|
||||||
|
wxLogError() and other log functions. But if you can't use one of those
|
||||||
|
functions and need to pass wxString objects to non-wx vararg functions, you
|
||||||
|
need to use the explicit casts as explained above.
|
||||||
|
|
||||||
|
|
||||||
|
@section string_performance Performance characteristics
|
||||||
|
|
||||||
|
wxString uses @c std::basic_string internally to store its content (unless
|
||||||
|
this is not supported by the compiler or disabled specifically when
|
||||||
|
building wxWidgets) and it therefore inherits many features from @c
|
||||||
|
std::basic_string. In particular, most modern implementations of @c
|
||||||
|
std::basic_string are thread-safe and don't use reference counting (making
|
||||||
|
copying large strings potentially expensive) and so wxString has the same
|
||||||
|
characteristics.
|
||||||
|
|
||||||
|
By default, wxString uses @c std::basic_string specialized for the
|
||||||
|
platform-dependent @c wchar_t type, meaning that it is not memory-efficient
|
||||||
|
for ASCII strings, especially under Unix platforms where every ASCII
|
||||||
|
character, normally fitting in a byte, is represented by a 4 byte @c
|
||||||
|
wchar_t.
|
||||||
|
|
||||||
|
It is possible to build wxWidgets with @c wxUSE_UNICODE_UTF8 set to 1 in
|
||||||
|
which case an UTF-8-encoded string representation is stored in @c
|
||||||
|
std::basic_string specialized for @c char, i.e. the usual @c std::string.
|
||||||
|
In this case the memory efficiency problem mentioned above doesn't arise
|
||||||
|
but run-time performance of many wxString methods changes dramatically, in
|
||||||
|
particular accessing the N-th character of the string becomes an operation
|
||||||
|
taking O(N) time instead of O(1), i.e. constant, time by default. Thus, if
|
||||||
|
you do use this so called UTF-8 build, you should avoid using indices to
|
||||||
|
access the strings whenever possible and use the iterators instead. As an
|
||||||
|
example, traversing the string using iterators is an O(N), where N is the
|
||||||
|
string length, operation in both the normal ("wchar_t") and UTF-8 builds
|
||||||
|
but doing it using indices becomes O(N^2) in UTF-8 case meaning that simply
|
||||||
|
checking every character of a reasonably long (e.g. a couple of millions
|
||||||
|
elements) string can take an unreasonably long time.
|
||||||
|
|
||||||
|
However, if you do use iterators, UTF-8 build can be a better choice than
|
||||||
|
the default build, especially for the memory-constrained embedded systems.
|
||||||
|
Notice also that GTK+ and DirectFB use UTF-8 internally, so using this
|
||||||
|
build not only saves memory for ASCII strings but also avoids conversions
|
||||||
|
between wxWidgets and the underlying toolkit.
|
||||||
|
|
||||||
|
|
||||||
@section string_index Index of the member groups
|
@section string_index Index of the member groups
|
||||||
@@ -497,9 +738,12 @@ public:
|
|||||||
/**
|
/**
|
||||||
Converts the string to an ASCII, 7-bit string in the form of
|
Converts the string to an ASCII, 7-bit string in the form of
|
||||||
a wxCharBuffer (Unicode builds only) or a C string (ANSI builds).
|
a wxCharBuffer (Unicode builds only) or a C string (ANSI builds).
|
||||||
Note that this conversion only works if the string contains only ASCII
|
|
||||||
characters. The @ref mb_str() "mb_str" method provides more
|
Note that this conversion is only lossless if the string contains only
|
||||||
powerful means of converting wxString to C string.
|
ASCII characters as all the non-ASCII ones are replaced with the @c '_'
|
||||||
|
(underscore) character.
|
||||||
|
|
||||||
|
Use mb_str() or utf8_str() to convert to other encodings.
|
||||||
*/
|
*/
|
||||||
const char* ToAscii() const;
|
const char* ToAscii() const;
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user