Implement wxRegEx using PCRE

Adjust the tests and document the incompatibilities with the previously
used regex syntax.

In this commit the use of PCRE is conditional on wxUSE_PCRE which is
never defined as 1 yet, so the new code is still disabled.
This commit is contained in:
Vadim Zeitlin
2021-07-17 17:00:19 +02:00
parent 912f4b76ac
commit fa59d5700a
5 changed files with 809 additions and 42 deletions

View File

@@ -12,13 +12,31 @@
*/
enum
{
/** Use extended regex syntax. */
/**
Use extended regex syntax.
This is the default and doesn't need to be specified.
*/
wxRE_EXTENDED = 0,
/** Use advanced RE syntax (built-in regex only). */
/**
Use advanced regex syntax.
This flag is synonym for wxRE_EXTENDED and doesn't need to be specified
as this is the default syntax.
*/
wxRE_ADVANCED = 1,
/** Use basic RE syntax. */
/**
Use basic regex syntax.
Use basic regular expression syntax, close to its POSIX definition,
but with some extensions still available.
The word start/end boundary assertions @c "\<" and @c "\>" are only
available when using basic syntax, use @c "[[:<:]] and @c "[[:>:]]" or
just more general word boundary assertion @c "\b" when not using it.
*/
wxRE_BASIC = 2,
/** Ignore case in match. */
@@ -51,7 +69,19 @@ enum
wxRE_NOTBOL = 32,
/** '$' doesn't match at the end of line. */
wxRE_NOTEOL = 64
wxRE_NOTEOL = 64,
/**
Don't accept empty string as a valid match.
If the regex matches an empty string, try alternatives, if there are
any, or fail.
This flag is not supported if PCRE support is turned off.
@since 3.1.6
*/
wxRE_NOTEMPTY = 128
};
/**
@@ -60,26 +90,19 @@ enum
wxRegEx represents a regular expression. This class provides support
for regular expressions matching and also replacement.
It is built on top of either the system library (if it has support
for POSIX regular expressions - which is the case of the most modern
Unices) or uses the built in Henry Spencer's library. Henry Spencer
would appreciate being given credit in the documentation of software
which uses his library, but that is not a requirement.
In wxWidgets 3.1.6 or later, it is built on top of PCRE library
(https://www.pcre.org/). In the previous versions of wxWidgets, this class
uses Henry Spencer's library and behaved slightly differently, see below
for the discussion of the changes if you're upgrading from an older
version.
Regular expressions, as defined by POSIX, come in two flavours: @e extended
and @e basic. The builtin library also adds a third flavour
of expression @ref overview_resyntax "advanced", which is not available
when using the system library.
Note that while C++11 and later provides @c std::regex and related classes,
this class is still useful as it provides the following important
advantages:
Unicode is fully supported only when using the builtin library.
When using the system library in Unicode mode, the expressions and data
are translated to the default 8-bit encoding before being passed to
the library.
On platforms where a system library is available, the default is to use
the builtin library for Unicode builds, and the system library otherwise.
It is possible to use the other if preferred by selecting it when building
the wxWidgets.
- Support for richer regular expressions syntax.
- Much better performance in many common cases, by a factor of 10-100.
- Consistent behaviour, including performance, on all platforms.
@library{wxbase}
@category{data}
@@ -118,6 +141,57 @@ enum
std::cout << "text now contains " << count << " hidden addresses" << std::endl;
std::cout << originalText << std::endl;
@endcode
@section regex_pcre_changes Changes in the PCRE-based version
This section describes the difference in regex syntax in the new PCRE-based
wxRegEx version compared to the previously used version which implemented
POSIX regex support.
The main change is that both extended (::wxRE_EXTENDED) and advanced
(::wxRE_ADVANCED) regex syntax is now the same as PCRE syntax described at
https://www.pcre.org/current/doc/html/pcre2syntax.html
Basic regular expressions (::wxRE_BASIC) are still different, but their
use is deprecated and PCRE extensions are still accepted in them, please
avoid using them.
Other changes are:
- Negated character classes, i.e. @c [^....], now always match newline
character, regardless of whether ::wxRE_NEWLINE was used or not. The dot
metacharacter still has the same meaning, i.e. it matches newline by
default but not when ::wxRE_NEWLINE is specified.
- Previously POSIX-specified behaviour of handling unmatched right
parenthesis @c ')' as a literal character was implemented, but now this
is a (regex) compilation error.
- Empty alternation branches were previously ignored, i.e. matching @c a||b
worked the same as matching just @c a|b, but now actually matches an
empty string. The new ::wxRE_NOTEMPTY flag can be used to disable empty
matches.
- Using @c \U to embed Unicode code points into the pattern is not
supported any more, use the still supported @c \u, followed by exactly
four hexadecimal digits, or @c \x, followed by exactly two hexadecimal
digits, instead.
- POSIX collating elements inside square brackets, i.e. @c [.XXX.] and
@c [:XXXX:] are not supported by PCRE and result in regex compilation
errors.
- Backslash can be used to escape the character following it even inside
square brackets now, while it loses its special meaning in POSIX regexes
when it occurs inside square brackets.
- For completeness, PCRE syntax which previously resulted in errors, e.g.
@c "(?:...)" and similar constructs, are now accepted and behave as
expected. Other regexes syntactically invalid according to POSIX are are
re-interpreted as sequences of literal characters with PCRE, e.g. @c "{1"
is just a sequence of two literal characters now, where it previously was
a compilation error.
*/
class wxRegEx
{