diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -104,13 +104,25 @@ or ``\``, you can precede them with a backslash to remove their special meaning: ``\[`` or ``\\``. -Some of the special sequences beginning with ``'\'`` represent predefined sets -of characters that are often useful, such as the set of digits, the set of -letters, or the set of anything that isn't whitespace. The following predefined -special sequences are a subset of those available. The equivalent classes are -for bytes patterns. For a complete list of sequences and expanded class -definitions for Unicode string patterns, see the last part of -:ref:`Regular Expression Syntax `. +Some of the special sequences beginning with ``'\'`` represent +predefined sets of characters that are often useful, such as the set +of digits, the set of letters, or the set of anything that isn't +whitespace. + +Let's take an example: ``\w`` matches any alphanumeric character. If +the regex pattern is expressed in bytes, this is equivalent to the +class ``[a-zA-Z0-9_]``. If the regex pattern is a string, ``\w`` will +match all the characters marked as letters in the Unicode database +provided by the :mod:`unicodedata` module. You can use the more +restricted definition of ``\w`` in a string pattern by supplying the +:const:`re.ASCII` flag when compiling the regular expression. + +The following list of special sequences isn't complete. For a complete +list of sequences and expanded class definitions for Unicode string +patterns, see the last part of :ref:`Regular Expression Syntax +` in the Standard Library reference. In general, the +Unicode versions match any character that's in the appropriate +category in the Unicode database. ``\d`` Matches any decimal digit; this is equivalent to the class ``[0-9]``. @@ -160,9 +172,8 @@ For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``), ``caaat`` (3 ``a`` characters), and so forth. The RE engine has various internal limitations stemming from the size of C's ``int`` type that will -prevent it from matching over 2 billion ``a`` characters; you probably don't -have enough memory to construct a string that large, so you shouldn't run into -that limit. +prevent it from matching over 2 billion ``a`` characters; patterns +are usually not written to match that much data. Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the @@ -495,17 +506,13 @@ more convenient. If a program contains a lot of regular expressions, or re-uses the same ones in several locations, then it might be worthwhile to collect all the definitions in one place, in a section of code that compiles all the REs -ahead of time. To take an example from the standard library, here's an extract -from the now-defunct Python 2 standard :mod:`xmllib` module:: +ahead of time. For example, here's an excerpt from a pure-Python XML parser:: ref = re.compile( ... ) entityref = re.compile( ... ) charref = re.compile( ... ) starttagopen = re.compile( ... ) -I generally prefer to work with the compiled object, even for one-time uses, but -few people will be as much of a purist about this as I am. - Compilation Flags ----------------- @@ -524,6 +531,10 @@ +---------------------------------+--------------------------------------------+ | Flag | Meaning | +=================================+============================================+ +| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, | +| | ``\s`` and ``\d`` match only on ASCII | +| | characters with the respective property. | ++---------------------------------+--------------------------------------------+ | :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including | | | newlines | +---------------------------------+--------------------------------------------+ @@ -535,11 +546,7 @@ | | ``$`` | +---------------------------------+--------------------------------------------+ | :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized | -| | more cleanly and understandably. | -+---------------------------------+--------------------------------------------+ -| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, | -| | ``\s`` and ``\d`` match only on ASCII | -| | characters with the respective property. | +| (for 'extended') | more cleanly and understandably. | +---------------------------------+--------------------------------------------+ @@ -558,7 +565,8 @@ LOCALE :noindex: - Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale. + Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale + instead of the Unicode database. Locales are a feature of the C library intended to help in writing programs that take account of language differences. For example, if you're processing French @@ -851,11 +859,10 @@ problem. Both of them use a common syntax for regular expression extensions, so we'll look at that first. -Perl 5 added several additional features to standard regular expressions, and -the Python :mod:`re` module supports most of them. It would have been -difficult to choose new single-keystroke metacharacters or new special sequences -beginning with ``\`` to represent the new features without making Perl's regular -expressions confusingly different from standard REs. If you chose ``&`` as a +Perl 5 is well-known for its powerful additions to standard regular expressions. +For these new features the Perl developers couldn't choose new single-keystroke metacharacters +or new special sequences beginning with ``\`` without making Perl's regular +expressions confusingly different from standard REs. If they chose ``&`` as a new metacharacter, for example, old expressions would be assuming that ``&`` was a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``. @@ -867,22 +874,15 @@ assertion) and ``(?:foo)`` is something else (a non-capturing group containing the subexpression ``foo``). -Python adds an extension syntax to Perl's extension syntax. If the first -character after the question mark is a ``P``, you know that it's an extension -that's specific to Python. Currently there are two such extensions: -``(?P...)`` defines a named group, and ``(?P=name)`` is a backreference to -a named group. If future versions of Perl 5 add similar features using a -different syntax, the :mod:`re` module will be changed to support the new -syntax, while preserving the Python-specific syntax for compatibility's sake. +Python supports several of Perl's extensions and adds an extension +syntax to Perl's extension syntax. If the first character after the +question mark is a ``P``, you know that it's an extension that's +specific to Python. -Now that we've looked at the general extension syntax, we can return to the -features that simplify working with groups in complex REs. Since groups are -numbered from left to right and a complex expression may use many groups, it can -become difficult to keep track of the correct numbering. Modifying such a -complex RE is annoying, too: insert a new group near the beginning and you -change the numbers of everything that follows it. +Now that we've looked at the general extension syntax, we can return +to the features that simplify working with groups in complex REs. -Sometimes you'll want to use a group to collect a part of a regular expression, +Sometimes you'll want to use a group to denote a part of a regular expression, but aren't interested in retrieving the group's contents. You can make this fact explicit by using a non-capturing group: ``(?:...)``, where you can replace the ``...`` with any other regular expression. :: @@ -908,7 +908,7 @@ The syntax for a named group is one of the Python-specific extensions: ``(?P...)``. *name* is, obviously, the name of the group. Named groups -also behave exactly like capturing groups, and additionally associate a name +behave exactly like capturing groups, and additionally associate a name with a group. The :ref:`match object ` methods that deal with capturing groups all accept either integers that refer to the group by number or strings that contain the desired group's name. Named groups are still @@ -975,9 +975,10 @@ ``.*[.].*$`` Notice that the ``.`` needs to be treated specially because it's a -metacharacter; I've put it inside a character class. Also notice the trailing -``$``; this is added to ensure that all the rest of the string must be included -in the extension. This regular expression matches ``foo.bar`` and +metacharacter, so it's inside a character class to only match that +specific character. Also notice the trailing ``$``; this is added to +ensure that all the rest of the string must be included in the +extension. This regular expression matches ``foo.bar`` and ``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``. Now, consider complicating the problem a bit; what if you want to match @@ -1051,7 +1052,7 @@ The :meth:`split` method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. It's similar to the :meth:`split` method of strings but provides much more generality in the -delimiters that you can split by; :meth:`split` only supports splitting by +delimiters that you can split by; string :meth:`split` only supports splitting by whitespace or by a fixed string. As you'd expect, there's a module-level :func:`re.split` function, too. @@ -1106,7 +1107,6 @@ with a different string. The :meth:`sub` method takes a replacement value, which can be either a string or a function, and the string to be processed. - .. method:: .sub(replacement, string[, count=0]) :noindex: @@ -1362,4 +1362,3 @@ reference for programming in Python. (The first edition covered Python's now-removed :mod:`regex` module, which won't help you much.) Consider checking it out from your library. -