Expand wxString overview and document some problems due to its dual nature.

Explain the possible problems with wxString due to its dual Unicode/ASCII nature. Also document the various conversions in the overview itself. Closes #14694. git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@73905 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775
2013-05-02 22:08:15 +00:00
parent 4322906f97
commit e33efe4839
1 changed files with 270 additions and 26 deletions
--- a/interface/wx/string.h
+++ b/interface/wx/string.h
@@ -10,34 +10,275 @@
 /**
    @class wxString
-    The wxString class has been completely rewritten for wxWidgets 3.0
+    String class for passing textual data to or receiving it from wxWidgets.
    and this change was actually the main reason for the calling that
    version wxWidgets 3.0.
-    wxString is a class representing a Unicode character string.
+    @note
-    wxString uses @c std::basic_string internally (even if @c wxUSE_STL is not defined)
+    While the use of wxString is unavoidable in wxWidgets program, you are
-    to store its content (unless this is not supported by the compiler or disabled
+    encouraged to use the standard string classes @c std::string or @c
-    specifically when building wxWidgets) and it therefore inherits
+    std::wstring in your applications and convert them to and from wxString
-    many features from @c std::basic_string. (Note that most implementations of
+    only when interacting with wxWidgets.
    @c std::basic_string are thread-safe and don't use reference counting.)
    These @c std::basic_string standard functions are only listed here, but
    they are not fully documented in this manual; see the STL documentation
    (http://www.cppreference.com/wiki/string/start) for more info.
    The behaviour of all these functions is identical to the behaviour
    described there.
-    You may notice that wxString sometimes has several functions which do
+    wxString is a class representing a Unicode character string but with
-    the same thing like Length(), Len() and length() which all return the
+    methods taking or returning both @c wchar_t wide characters and @c wchar_t*
-    string length. In all cases of such duplication the @c std::string
+    wide strings and traditional @c char characters and @c char* strings. The
-    compatible methods should be used.
+    dual nature of wxString API makes it simple to use in all cases and,
    importantly, allows the code written for either ANSI or Unicode builds of
    the previous wxWidgets versions to compile and work correctly with the
    single unified Unicode build of wxWidgets 3.0. It is also mostly
    transparent when using wxString with the few exceptions described below.
    For informations about the internal encoding used by wxString and
    for important warnings and advices for using it, please read
    the @ref overview_string.
-    Since wxWidgets 3.0 wxString always stores Unicode strings, so you should
+    @section string_api API overview
-    be sure to read also @ref overview_unicode.
+
    wxString tries to be similar to both @c std::string and @c std::wstring and
    can mostly be used as either class. It provides practically all of the
    methods of these classes, which behave exactly the same as in the standard
    C++, and so are not documented here (please see any standard library
    documentation, for example http://en.cppreference.com/w/cpp/string for more
    details).
    In addition to these standard methods, wxString adds functions dealing with
    the conversions between different string encodings, described below, as
    well as many extra helpers such as functions for formatted output
    (Printf(), Format(), ...), case conversion (MakeUpper(), Capitalize(), ...)
    and various others (Trim(), StartsWith(), Matches(), ...). All of the
    non-standard methods follow wxWidgets "CamelCase" naming convention and are
    documented here.
    Notice that some wxString methods exist in several versions for
    compatibility reasons. For example all of length(), Length() and Len() are
    provided. In such cases it is recommended to use the standard string-like
    method, i.e. length() in this case.
    @section string_conv Converting to and from wxString
    wxString can be created from:
        - ASCII string guaranteed to contain only 7 bit characters using
        wxString::FromAscii().
        - Narrow @c char* string in the current locale encoding using implicit
        wxString::wxString(const char*) constructor.
        - Narrow @c char* string in UTF-8 encoding using wxString::FromUTF8().
        - Narrow @c char* string in the given encoding using
        wxString::wxString(const char*, const wxMBConv&) constructor passing a
        wxCSConv corresponding to the encoding as the second argument.
        - Standard @c std::string using implicit wxString::wxString(const
        std::string&) constructor. Notice that this constructor supposes that
        the string contains data in the current locale encoding, use FromUTF8()
        or the constructor taking wxMBConv if this is not the case.
        - Wide @c wchar_t* string using implicit wxString::wxString(const
        wchar_t*) constructor.
        - Standard @c std::wstring using implicit wxString::wxString(const
        std::wstring&) constructor.
    Notice that many of the constructors are implicit, meaning that you don't
    even need to write them at all to pass the existing string to some
    wxWidgets function taking a wxString.
    Similarly, wxString can be converted to:
        - ASCII string using wxString::ToAscii(). This is a potentially
        destructive operation as all non-ASCII string characters are replaced
        with a placeholder character.
        - String in the current locale encoding implicitly or using c_str() or
        mb_str() methods. This is a potentially destructive operation as an @e
        empty string is returned if the conversion fails.
        - String in UTF-8 encoding using wxString::utf8_str().
        - String in any given encoding using mb_str() with the appropriate
        wxMBConv object. This is also a potentially destructive operation.
        - Standard @c std::string using wxString::ToStdString(). The contents
        of the returned string use the current locale encoding, so this
        conversion is potentially destructive as well.
        - Wide C string using wxString::wc_str().
        - Standard @c std::wstring using wxString::ToStdWstring().
    @note If you built wxWidgets with @c wxUSE_STL set to 1, the implicit
        conversions to both narrow and wide C strings are disabled and replaced
        with implicit conversions to @c std::string and @c std::wstring.
    Please notice that the conversions marked as "potentially destructive"
    above can result in loss of data if their result is not checked, so you
    need to verify that converting the contents of a non-empty Unicode string
    to a non-UTF-8 multibyte encoding results in non-empty string. The simplest
    and best way to ensure that the conversion never fails is to always use
    UTF-8.
    @section string_gotchas Traps for the unwary
    As mentioned above, wxString tries to be compatible with both narrow and
    wide standard string classes and mostly does it transparently, but there
    are some exceptions.
    @subsection string_gotchas_element String element access
    Some problems are caused by wxString::operator[]() which returns an object
    of a special proxy class allowing to assign either a simple @c char or a @c
    wchar_t to the given index. Because of this, the return type of this
    operator is neither @c char nor @c wchar_t nor a reference to one of these
    types but wxUniCharRef which is not a primitive type and hence can't be
    used in the @c switch statement. So the following code does @e not compile
        @code
            wxString s(...);
            switch ( s[n] ) {
                case 'A':
                    ...
                    break;
            }
        @endcode
    and you need to use
        @code
            switch ( s[n].GetValue() ) {
                ...
            }
        @endcode
    instead. Alternatively, you can use an explicit cast:
        @code
            switch ( static_cast<char>(s[n]) ) {
                ...
            }
        @endcode
    but notice that this will result in an assert failure if the character at
    the given position is not representable as a single @c char in the current
    encoding, so you may want to cast to @c int instead if non-ASCII values can
    be used.
    Another consequence of this unusual return type arises when it is used with
    template deduction or C++11 @c auto keyword. Unlike with the normal
    references which are deduced to be of the referenced type, the deduced type
    for wxUniCharRef is wxUniCharRef itself. This results in potentially
    unexpected behaviour, for example:
        @code
            wxString s("abc");
            auto c = s[0];
            c = 'x';            // Modifies the string!
            wxASSERT( s == "xbc" );
        @endcode
    Due to this, either explicitly specify the variable type:
        @code
            int c = s[0];
            c = 'x';            // Doesn't modify the string any more.
            wxASSERT( s == "abc" );
        @endcode
    or explicitly convert the return value:
        @code
            auto c = s[0].GetValue();
            c = 'x';            // Doesn't modify the string neither.
            wxASSERT( s == "abc" );
        @endcode
    @subsection string_gotchas_conv Conversion to C string
    A different class of problems happens due to the dual nature of the return
    value of wxString::c_str() method, which is also used for implicit
    conversions. The result of calls to this method is convertible to either
    narrow @c char* string or wide @c wchar_t* string and so, again, has
    neither the former nor the latter type. Usually, the correct type will be
    chosen depending on how you use the result but sometimes the compiler can't
    choose it because of an ambiguity, e.g.:
        @code
            // Some non-wxWidgets functions existing for both narrow and wide
            // strings:
            void dump_text(const char* text);       // Version (1)
            void dump_text(const wchar_t* text);    // Version (2)
            wxString s(...);
            dump_text(s);           // ERROR: ambiguity.
            dump_text(s.c_str());   // ERROR: still ambiguous.
        @endcode
    In this case you need to explicitly convert to the type that you need to
    use or use a different, non-ambiguous, conversion function (which is
    usually the best choice):
        @code
            dump_text(static_cast<const char*>(s));            // OK, calls (1)
            dump_text(static_cast<const wchar_t*>(s.c_str())); // OK, calls (2)
            dump_text(s.mb_str());                             // OK, calls (1)
            dump_text(s.wc_str());                             // OK, calls (2)
            dump_text(s.wx_str());                             // OK, calls ???
        @endcode
    @subsection string_vararg Using wxString with vararg functions
    A special subclass of the problems arising due to the polymorphic nature of
    wxString::c_str() result type happens when using functions taking an
    arbitrary number of arguments, such as the standard @c printf(). Due to the
    rules of the C++ language, the types for the "variable" arguments of such
    functions are not specified and hence the compiler cannot convert wxString
    objects, or the objects returned by wxString::c_str(), to these unknown
    types automatically. Hence neither wxString objects nor the results of most
    of the conversion functions can be passed as vararg arguments:
        @code
            // ALL EXAMPLES HERE DO NOT WORK, DO NOT USE THEM!
            printf("Don't do this: %s", s);
            printf("Don't do that: %s", s.c_str());
            printf("Nor even this: %s", s.mb_str());
            wprintf("And even not always this: %s", s.wc_str());
        @endcode
    Instead you need to either explicitly cast to the needed type:
        @code
            // These examples work but are not the best solution, see below.
            printf("You can do this: %s", static_cast<const char*>(s));
            printf("Or this: %s", static_cast<const char*>(s.c_str()));
            printf("And this: %s", static_cast<const char*>(s.mb_str()));
            wprintf("Or this: %s", static_cast<const wchar_t*>(s.wc_str()));
        @endcode
    But a better solution is to use wxWidgets-provided functions, if possible,
    as is the case for @c printf family of functions:
        @code
            // This is the recommended way.
            wxPrintf("You can do just this: %s", s);
            wxPrintf("And this (but it is redundant): %s", s.c_str());
            wxPrintf("And this (not using Unicode): %s", s.mb_str());
            wxPrintf("And this (always Unicode): %s", s.wc_str());
        @endcode
    Notice that wxPrintf() replaces both @c printf() and @c wprintf() and
    accepts wxString objects, results of c_str() calls but also @c char* and
    @c wchar_t* strings directly.
    wxWidgets provides wx-prefixed equivalents to all the standard vararg
    functions and a few more, notably wxString::Format(), wxLogMessage(),
    wxLogError() and other log functions. But if you can't use one of those
    functions and need to pass wxString objects to non-wx vararg functions, you
    need to use the explicit casts as explained above.
    @section string_performance Performance characteristics
    wxString uses @c std::basic_string internally to store its content (unless
    this is not supported by the compiler or disabled specifically when
    building wxWidgets) and it therefore inherits many features from @c
    std::basic_string. In particular, most modern implementations of @c
    std::basic_string are thread-safe and don't use reference counting (making
    copying large strings potentially expensive) and so wxString has the same
    characteristics.
    By default, wxString uses @c std::basic_string specialized for the
    platform-dependent @c wchar_t type, meaning that it is not memory-efficient
    for ASCII strings, especially under Unix platforms where every ASCII
    character, normally fitting in a byte, is represented by a 4 byte @c
    wchar_t.
    It is possible to build wxWidgets with @c wxUSE_UNICODE_UTF8 set to 1 in
    which case an UTF-8-encoded string representation is stored in @c
    std::basic_string specialized for @c char, i.e. the usual @c std::string.
    In this case the memory efficiency problem mentioned above doesn't arise
    but run-time performance of many wxString methods changes dramatically, in
    particular accessing the N-th character of the string becomes an operation
    taking O(N) time instead of O(1), i.e. constant, time by default. Thus, if
    you do use this so called UTF-8 build, you should avoid using indices to
    access the strings whenever possible and use the iterators instead. As an
    example, traversing the string using iterators is an O(N), where N is the
    string length, operation in both the normal ("wchar_t") and UTF-8 builds
    but doing it using indices becomes O(N^2) in UTF-8 case meaning that simply
    checking every character of a reasonably long (e.g. a couple of millions
    elements) string can take an unreasonably long time.
    However, if you do use iterators, UTF-8 build can be a better choice than
    the default build, especially for the memory-constrained embedded systems.
    Notice also that GTK+ and DirectFB use UTF-8 internally, so using this
    build not only saves memory for ASCII strings but also avoids conversions
    between wxWidgets and the underlying toolkit.
    @section string_index Index of the member groups
@@ -497,9 +738,12 @@ public:
    /**
        Converts the string to an ASCII, 7-bit string in the form of
        a wxCharBuffer (Unicode builds only) or a C string (ANSI builds).
-        Note that this conversion only works if the string contains only ASCII
+
-        characters. The @ref mb_str() "mb_str" method provides more
+        Note that this conversion is only lossless if the string contains only
-        powerful means of converting wxString to C string.
+        ASCII characters as all the non-ASCII ones are replaced with the @c '_'
        (underscore) character.
        Use mb_str() or utf8_str() to convert to other encodings.
    */
    const char* ToAscii() const;