Implement wxRegEx using PCRE

Adjust the tests and document the incompatibilities with the previously used regex syntax. In this commit the use of PCRE is conditional on wxUSE_PCRE which is never defined as 1 yet, so the new code is still disabled.
2021-07-17 17:00:19 +02:00
parent 912f4b76ac
commit fa59d5700a
5 changed files with 809 additions and 42 deletions
--- a/interface/wx/regex.h
+++ b/interface/wx/regex.h
@@ -12,13 +12,31 @@
 */
 enum
 {
-    /** Use extended regex syntax. */
+    /**
+        Use extended regex syntax.
+
+        This is the default and doesn't need to be specified.
+     */
    wxRE_EXTENDED = 0,

-    /** Use advanced RE syntax (built-in regex only). */
+    /**
+        Use advanced regex syntax.
+
+        This flag is synonym for wxRE_EXTENDED and doesn't need to be specified
+        as this is the default syntax.
+     */
    wxRE_ADVANCED = 1,

-    /** Use basic RE syntax. */
+    /**
+        Use basic regex syntax.
+
+        Use basic regular expression syntax, close to its POSIX definition,
+        but with some extensions still available.
+
+        The word start/end boundary assertions @c "\<" and @c "\>" are only
+        available when using basic syntax, use @c "[[:<:]] and @c "[[:>:]]" or
+        just more general word boundary assertion @c "\b" when not using it.
+     */
    wxRE_BASIC    = 2,

    /** Ignore case in match. */
@@ -51,7 +69,19 @@ enum
    wxRE_NOTBOL = 32,

    /** '$' doesn't match at the end of line. */
-    wxRE_NOTEOL = 64
+    wxRE_NOTEOL = 64,
+
+    /**
+        Don't accept empty string as a valid match.
+
+        If the regex matches an empty string, try alternatives, if there are
+        any, or fail.
+
+        This flag is not supported if PCRE support is turned off.
+
+        @since 3.1.6
+     */
+    wxRE_NOTEMPTY = 128
 };

 /**
@@ -60,26 +90,19 @@ enum
    wxRegEx represents a regular expression.  This class provides support
    for regular expressions matching and also replacement.

-    It is built on top of either the system library (if it has support
-    for POSIX regular expressions - which is the case of the most modern
-    Unices) or uses the built in Henry Spencer's library.  Henry Spencer
-    would appreciate being given credit in the documentation of software
-    which uses his library, but that is not a requirement.
+    In wxWidgets 3.1.6 or later, it is built on top of PCRE library
+    (https://www.pcre.org/). In the previous versions of wxWidgets, this class
+    uses Henry Spencer's library and behaved slightly differently, see below
+    for the discussion of the changes if you're upgrading from an older
+    version.

-    Regular expressions, as defined by POSIX, come in two flavours: @e extended
-    and @e basic.  The builtin library also adds a third flavour
-    of expression @ref overview_resyntax "advanced", which is not available
-    when using the system library.
+    Note that while C++11 and later provides @c std::regex and related classes,
+    this class is still useful as it provides the following important
+    advantages:

-    Unicode is fully supported only when using the builtin library.
-    When using the system library in Unicode mode, the expressions and data
-    are translated to the default 8-bit encoding before being passed to
-    the library.
-
-    On platforms where a system library is available, the default is to use
-    the builtin library for Unicode builds, and the system library otherwise.
-    It is possible to use the other if preferred by selecting it when building
-    the wxWidgets.
+    - Support for richer regular expressions syntax.
+    - Much better performance in many common cases, by a factor of 10-100.
+    - Consistent behaviour, including performance, on all platforms.

    @library{wxbase}
    @category{data}
@@ -118,6 +141,57 @@ enum
    std::cout << "text now contains " << count << " hidden addresses" << std::endl;
    std::cout << originalText << std::endl;
    @endcode
+
+
+    @section regex_pcre_changes Changes in the PCRE-based version
+
+    This section describes the difference in regex syntax in the new PCRE-based
+    wxRegEx version compared to the previously used version which implemented
+    POSIX regex support.
+
+    The main change is that both extended (::wxRE_EXTENDED) and advanced
+    (::wxRE_ADVANCED) regex syntax is now the same as PCRE syntax described at
+    https://www.pcre.org/current/doc/html/pcre2syntax.html
+
+    Basic regular expressions (::wxRE_BASIC) are still different, but their
+    use is deprecated and PCRE extensions are still accepted in them, please
+    avoid using them.
+
+    Other changes are:
+
+    - Negated character classes, i.e. @c [^....], now always match newline
+      character, regardless of whether ::wxRE_NEWLINE was used or not. The dot
+      metacharacter still has the same meaning, i.e. it matches newline by
+      default but not when ::wxRE_NEWLINE is specified.
+
+    - Previously POSIX-specified behaviour of handling unmatched right
+      parenthesis @c ')' as a literal character was implemented, but now this
+      is a (regex) compilation error.
+
+    - Empty alternation branches were previously ignored, i.e. matching @c a||b
+      worked the same as matching just @c a|b, but now actually matches an
+      empty string. The new ::wxRE_NOTEMPTY flag can be used to disable empty
+      matches.
+
+    - Using @c \U to embed Unicode code points into the pattern is not
+      supported any more, use the still supported @c \u, followed by exactly
+      four hexadecimal digits, or @c \x, followed by exactly two hexadecimal
+      digits, instead.
+
+    - POSIX collating elements inside square brackets, i.e. @c [.XXX.] and
+      @c [:XXXX:] are not supported by PCRE and result in regex compilation
+      errors.
+
+    - Backslash can be used to escape the character following it even inside
+      square brackets now, while it loses its special meaning in POSIX regexes
+      when it occurs inside square brackets.
+
+    - For completeness, PCRE syntax which previously resulted in errors, e.g.
+      @c "(?:...)" and similar constructs, are now accepted and behave as
+      expected. Other regexes syntactically invalid according to POSIX are are
+      re-interpreted as sequences of literal characters with PCRE, e.g. @c "{1"
+      is just a sequence of two literal characters now, where it previously was
+      a compilation error.
 */
 class wxRegEx
 {