Message 321963 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	zwol
Recipients	docs@python, ezio.melotti, mrabarnett, zwol
Date	2018-07-19.19:03:20
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1532027000.32.0.56676864532.issue34156@psf.upfronthosting.co.za>
In-reply-to

Content
The documentation of the semantics of range expressions in regular expression character classes is not precise enough. All it says is Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter [... more examples, none involving non-ASCII characters] In testing it seems that the behavior is simply to expand the range to a set of characters by numeric code point, e.g. '[ᄀ-ፚ]' will match any single character whose ord() is in between ord('ᄀ') and ord('ፚ') (inclusive). If that is the intended behavior, I would like the documentation to explicitly say so. If that is _not_ the intended behavior, I would like to know what the intended behavior actually is, and for both the code and the documentation to be changed to reflect the intent. (I think expansion by numeric code point makes sense and is probably what most existing programs want, but this is a contended issue in the context of POSIX regular expressions, e.g. some C libraries try (not always successfully) to make [0-9] match all of the characters that Python's \d matches, so it's not "obvious".)

The documentation of the semantics of range expressions in regular expression character classes is not precise enough.  All it says is

    Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter [... more examples, none involving non-ASCII characters]

In testing it seems that the behavior is simply to expand the range to a set of characters by numeric code point, e.g. '[ᄀ-ፚ]' will match any single character whose ord() is in between ord('ᄀ') and ord('ፚ') (inclusive).  If that is the intended behavior, I would like the documentation to explicitly say so.  If that is _not_ the intended behavior, I would like to know what the intended behavior actually is, and for both the code and the documentation to be changed to reflect the intent.

(I think expansion by numeric code point makes sense and is probably what most existing programs want, but this is a contended issue in the context of POSIX regular expressions, e.g. some C libraries try (not always successfully) to make [0-9] match all of the characters that Python's \d matches, so it's not "obvious".)

History
Date	User	Action	Args
2018-07-19 19:03:20	zwol	set	recipients: + zwol, ezio.melotti, mrabarnett, docs@python
2018-07-19 19:03:20	zwol	set	messageid: <1532027000.32.0.56676864532.issue34156@psf.upfronthosting.co.za>
2018-07-19 19:03:20	zwol	link	issue34156 messages
2018-07-19 19:03:20	zwol	create