This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: string.Template should use re.ASCII flag
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: barry, ezio.melotti, methane, mrabarnett, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-10-03 11:15 by methane, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 3872 merged methane, 2017-10-03 16:27
PR 3982 merged methane, 2017-10-13 07:33
PR 3990 merged methane, 2017-10-14 05:45
PR 4483 merged barry, 2017-11-21 01:12
PR 5099 merged serhiy.storchaka, 2018-01-04 14:35
Messages (23)
msg303596 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-03 11:15
Currently, strings.Template uses re.IGNORECASE without re.ASCII:

    idpattern = r'[_a-z][_a-z0-9]*'
    flags = _re.IGNORECASE

[a-z] matches against 'ı' (0x131, DOTLESS I) and 'ſ' (0x17f, LONG S).
It is not intentional, and it makes re.compile slower.
msg303618 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-03 13:51
Technically it *is* an API change since `flags` is a part of the public API.  The documentation says:

    $identifier names a substitution placeholder matching a mapping key of 
    "identifier". By default, "identifier" is restricted to any case-
    insensitive ASCII alphanumeric string (including underscores) that starts 
    with an underscore or ASCII letter. The first non-identifier character 
    after the $ character terminates this placeholder specification.

This means if someone does subclass string.Template and changes the pattern to accept Unicode identifiers, then with this change they will also have to modify flags, whereas before they didn't.

It really wasn't ever the intention to allow non-ASCII identifiers, so this is probably safe in practice.  OTOH, making the change for performance reasons might be questionable, given that the regular expressions are compiled by the Template's metaclass, so unlikely to contribute significantly to overall performance wins.
msg303627 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-03 16:31
> This means if someone does subclass string.Template and changes the pattern to accept Unicode identifiers, then with this change they will also have to modify flags, whereas before they didn't.

Thank you for pointing it out.
I removed re.A flag after compilation.

> OTOH, making the change for performance reasons might be questionable, given that the regular expressions are compiled by the Template's metaclass, so unlikely to contribute significantly to overall performance wins.

original: import time:      2310 |       9589 | string
patched:  import time:      1420 |       8702 | string

We can save about 900 us.
msg303670 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-10-04 05:34
When optimizing, please don't make API changes.
msg303671 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-04 05:55
Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.
msg303673 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-04 06:24
> When optimizing, please don't make API changes.

This is not only optimization, but bugfix.

Document of string.Template says:

> By default, "identifier" is restricted to any case-insensitive ASCII  alphanumeric string (including underscores) that starts with an underscore or ASCII letter.

So, missing re.ASCII flag is bug because non-ASCII alphabet can be matched.
msg303674 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-04 06:29
> Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.

https://docs.python.org/3.6/library/re.html#regular-expression-syntax says
`(?imsx-imsx:...)`

Anyway, Template.idpattern is documented public API too.
msg303675 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-04 06:37
Current pull request override `Template.flags = re.I` after class creation for backward compatibility, without any API change.

But I'm not sure it's right approach.
How many people who subclass string.Template expect non-ASCII match?
If this change is bugfix for 99% of subclasses, why should we keep pitfall?
msg303692 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-04 13:57
On Oct 4, 2017, at 02:29, INADA Naoki <report@bugs.python.org> wrote:
> 
> INADA Naoki <songofacandy@gmail.com> added the comment:
> 
>> Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.
> 
> https://docs.python.org/3.6/library/re.html#regular-expression-syntax says
> `(?imsx-imsx:...)`
> 
> Anyway, Template.idpattern is documented public API too.

Too bad, because I like that approach.  How hard would it be to add support for ‘a’ as a local flag?  (I’m kind of surprised that it isn’t already supported - it seems like it would be useful.)

I would like this better than hacking Template.flags after the fact.  It seems like the right way to align the code with the documentation (i.e. fix a bug).
msg303694 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-04 14:05
See issue31690. But this solution can be used only in 3.7.
msg303696 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-04 14:08
On Oct 4, 2017, at 10:05, Serhiy Storchaka <report@bugs.python.org> wrote:
> 
> See issue31690. But this solution can be used only in 3.7.

That’s fine.  I don’t think this is important enough to backport.
msg303844 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-06 19:04
Another solution (works in 3.6 too): set idpattern to r'(?-i:[_a-zA-Z][_a-zA-Z0-9]*)'.

This looks pretty weird, setting the re.IGNORECASE flag and then unsetting it. But it works, and don't break the user code that changes idpattern without changing flags.
msg304052 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-10 15:25
On Oct 6, 2017, at 15:04, Serhiy Storchaka <report@bugs.python.org> wrote:
> Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:
> 
> Another solution (works in 3.6 too): set idpattern to r'(?-i:[_a-zA-Z][_a-zA-Z0-9]*)'.
> 
> This looks pretty weird, setting the re.IGNORECASE flag and then unsetting it. But it works, and don't break the user code that changes idpattern without changing flags.

Oh, I think I like that :)
msg304053 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-10 15:25
With some comments to clarify of course.
msg304313 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-13 07:02
New changeset b22273ec5d1992b0cbe078b887427ae9977dfb78 by INADA Naoki in branch 'master':
bpo-31672: Fix string.Template accidentally matched non-ASCII identifiers (GH-3872)
https://github.com/python/cpython/commit/b22273ec5d1992b0cbe078b887427ae9977dfb78
msg304376 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-14 05:22
New changeset 7060380d577690a40ebc201c0725076349e977cd by INADA Naoki in branch '3.6':
bpo-31672: Fix string.Template accidentally matched non-ASCII identifiers (GH-3872)
https://github.com/python/cpython/commit/7060380d577690a40ebc201c0725076349e977cd
msg304392 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-14 12:22
New changeset 073150db39408c1800e4b9e895ad0b0e195f1056 by INADA Naoki in branch 'master':
bpo-31672: doc: Remove one sentence from library/string.rst (GH-3990)
https://github.com/python/cpython/commit/073150db39408c1800e4b9e895ad0b0e195f1056
msg306350 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-16 10:40
Is something left to do here?

In 3.7 r'(?-i:[_a-zA-Z][_a-zA-Z0-9]*)' can be replaced with r'(?a:[_a-z][_a-z0-9]*)' if it will add clarity.
msg306607 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-11-21 01:17
@inada.naoki - I think changing this to ?a will make things more clear.  I submitted a PR for that, and then once it lands, we can close this issue.
msg306651 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-11-21 15:28
New changeset e256b408889eba867e1d90e5e1a0904843256255 by Barry Warsaw in branch 'master':
bpo-31672 - Add one last minor clarification for idpattern (#4483)
https://github.com/python/cpython/commit/e256b408889eba867e1d90e5e1a0904843256255
msg309468 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-01-04 14:39
PR 5099 reverts in 3.7 a subtle behavior change made by the fix for this issue. Before merging the fix setting flags to 0 in a Template subclass made the default pattern matching only lower case letters. Currently setting flags to 0 doesn't have any effect.
msg309472 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-01-04 17:20
New changeset 87be28f4a1c5b76926c71a3d9f92503f9eb82d51 by Serhiy Storchaka in branch 'master':
bpo-31672: Restore the former behavior when override flags in Template. (#5099)
https://github.com/python/cpython/commit/87be28f4a1c5b76926c71a3d9f92503f9eb82d51
msg309475 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-01-04 17:40
Results.

There are 4 cases:

1) default idpattern and flags
2) overridden only idpattern
3) overridden only flags
4) overridden both idpattern and flags

The case 1 was the one that was broken when this issue was opened. The initial Inada's version fixed the case 1, but broke the case 2. His final version (applied also to 3.6) fixed the case 1, but broke the case 3. This is a win, because cases 1 and 2 look much more common than the case 3. And finally PR 5099 has fixed also the case 3. The case 4 obviously is not affected by any changes of default values.

Now all four cases are correct in 3.7 and the only broken case in 3.6 is the uncommon case 3.
History
Date User Action Args
2022-04-11 14:58:53adminsetgithub: 75853
2018-01-04 17:40:09serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg309475

stage: resolved
2018-01-04 17:20:14serhiy.storchakasetmessages: + msg309472
2018-01-04 14:39:54serhiy.storchakasetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg309468

stage: resolved -> (no value)
2018-01-04 14:35:12serhiy.storchakasetpull_requests: + pull_request4967
2017-11-21 15:28:31barrysetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-11-21 15:28:16barrysetmessages: + msg306651
2017-11-21 01:17:16barrysetmessages: + msg306607
2017-11-21 01:12:45barrysetpull_requests: + pull_request4420
2017-11-16 10:40:13serhiy.storchakasetmessages: + msg306350
2017-10-14 12:22:45methanesetmessages: + msg304392
2017-10-14 05:45:03methanesetpull_requests: + pull_request3966
2017-10-14 05:22:05methanesetmessages: + msg304376
2017-10-13 07:33:11methanesetpull_requests: + pull_request3958
2017-10-13 07:02:27methanesetmessages: + msg304313
2017-10-10 15:25:52barrysetmessages: + msg304053
2017-10-10 15:25:28barrysetmessages: + msg304052
2017-10-06 19:04:41serhiy.storchakasetmessages: + msg303844
2017-10-04 14:08:51barrysetmessages: + msg303696
2017-10-04 14:05:39serhiy.storchakasetmessages: + msg303694
2017-10-04 13:57:11barrysetmessages: + msg303692
2017-10-04 06:37:11methanesetmessages: + msg303675
2017-10-04 06:29:15methanesetmessages: + msg303674
2017-10-04 06:24:18methanesetmessages: + msg303673
2017-10-04 05:55:22serhiy.storchakasetmessages: + msg303671
2017-10-04 05:34:38rhettingersetnosy: + rhettinger
messages: + msg303670
2017-10-03 16:32:21methanesettype: behavior
stage: patch review
2017-10-03 16:31:54methanesetmessages: + msg303627
stage: patch review -> (no value)
2017-10-03 16:27:54methanesetkeywords: + patch
stage: patch review
pull_requests: + pull_request3851
2017-10-03 13:51:50barrysetmessages: + msg303618
2017-10-03 12:54:43serhiy.storchakasetnosy: + barry, serhiy.storchaka
2017-10-03 12:50:19methanelinkissue31669 superseder
2017-10-03 11:15:40methanecreate