classification
Title: string.Template should use re.ASCII flag
Type: behavior Stage: patch review
Components: Regular Expressions Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, ezio.melotti, inada.naoki, mrabarnett, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-10-03 11:15 by inada.naoki, last changed 2017-10-14 12:22 by inada.naoki.

Pull Requests
URL Status Linked Edit
PR 3872 merged inada.naoki, 2017-10-03 16:27
PR 3982 merged inada.naoki, 2017-10-13 07:33
PR 3990 merged inada.naoki, 2017-10-14 05:45
Messages (17)
msg303596 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-03 11:15
Currently, strings.Template uses re.IGNORECASE without re.ASCII:

    idpattern = r'[_a-z][_a-z0-9]*'
    flags = _re.IGNORECASE

[a-z] matches against 'ı' (0x131, DOTLESS I) and 'ſ' (0x17f, LONG S).
It is not intentional, and it makes re.compile slower.
msg303618 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-03 13:51
Technically it *is* an API change since `flags` is a part of the public API.  The documentation says:

    $identifier names a substitution placeholder matching a mapping key of 
    "identifier". By default, "identifier" is restricted to any case-
    insensitive ASCII alphanumeric string (including underscores) that starts 
    with an underscore or ASCII letter. The first non-identifier character 
    after the $ character terminates this placeholder specification.

This means if someone does subclass string.Template and changes the pattern to accept Unicode identifiers, then with this change they will also have to modify flags, whereas before they didn't.

It really wasn't ever the intention to allow non-ASCII identifiers, so this is probably safe in practice.  OTOH, making the change for performance reasons might be questionable, given that the regular expressions are compiled by the Template's metaclass, so unlikely to contribute significantly to overall performance wins.
msg303627 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-03 16:31
> This means if someone does subclass string.Template and changes the pattern to accept Unicode identifiers, then with this change they will also have to modify flags, whereas before they didn't.

Thank you for pointing it out.
I removed re.A flag after compilation.

> OTOH, making the change for performance reasons might be questionable, given that the regular expressions are compiled by the Template's metaclass, so unlikely to contribute significantly to overall performance wins.

original: import time:      2310 |       9589 | string
patched:  import time:      1420 |       8702 | string

We can save about 900 us.
msg303670 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-10-04 05:34
When optimizing, please don't make API changes.
msg303671 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-04 05:55
Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.
msg303673 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-04 06:24
> When optimizing, please don't make API changes.

This is not only optimization, but bugfix.

Document of string.Template says:

> By default, "identifier" is restricted to any case-insensitive ASCII  alphanumeric string (including underscores) that starts with an underscore or ASCII letter.

So, missing re.ASCII flag is bug because non-ASCII alphabet can be matched.
msg303674 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-04 06:29
> Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.

https://docs.python.org/3.6/library/re.html#regular-expression-syntax says
`(?imsx-imsx:...)`

Anyway, Template.idpattern is documented public API too.
msg303675 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-04 06:37
Current pull request override `Template.flags = re.I` after class creation for backward compatibility, without any API change.

But I'm not sure it's right approach.
How many people who subclass string.Template expect non-ASCII match?
If this change is bugfix for 99% of subclasses, why should we keep pitfall?
msg303692 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-04 13:57
On Oct 4, 2017, at 02:29, INADA Naoki <report@bugs.python.org> wrote:
> 
> INADA Naoki <songofacandy@gmail.com> added the comment:
> 
>> Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.
> 
> https://docs.python.org/3.6/library/re.html#regular-expression-syntax says
> `(?imsx-imsx:...)`
> 
> Anyway, Template.idpattern is documented public API too.

Too bad, because I like that approach.  How hard would it be to add support for ‘a’ as a local flag?  (I’m kind of surprised that it isn’t already supported - it seems like it would be useful.)

I would like this better than hacking Template.flags after the fact.  It seems like the right way to align the code with the documentation (i.e. fix a bug).
msg303694 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-04 14:05
See issue31690. But this solution can be used only in 3.7.
msg303696 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-04 14:08
On Oct 4, 2017, at 10:05, Serhiy Storchaka <report@bugs.python.org> wrote:
> 
> See issue31690. But this solution can be used only in 3.7.

That’s fine.  I don’t think this is important enough to backport.
msg303844 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-06 19:04
Another solution (works in 3.6 too): set idpattern to r'(?-i:[_a-zA-Z][_a-zA-Z0-9]*)'.

This looks pretty weird, setting the re.IGNORECASE flag and then unsetting it. But it works, and don't break the user code that changes idpattern without changing flags.
msg304052 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-10 15:25
On Oct 6, 2017, at 15:04, Serhiy Storchaka <report@bugs.python.org> wrote:
> Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:
> 
> Another solution (works in 3.6 too): set idpattern to r'(?-i:[_a-zA-Z][_a-zA-Z0-9]*)'.
> 
> This looks pretty weird, setting the re.IGNORECASE flag and then unsetting it. But it works, and don't break the user code that changes idpattern without changing flags.

Oh, I think I like that :)
msg304053 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-10 15:25
With some comments to clarify of course.
msg304313 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-13 07:02
New changeset b22273ec5d1992b0cbe078b887427ae9977dfb78 by INADA Naoki in branch 'master':
bpo-31672: Fix string.Template accidentally matched non-ASCII identifiers (GH-3872)
https://github.com/python/cpython/commit/b22273ec5d1992b0cbe078b887427ae9977dfb78
msg304376 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-14 05:22
New changeset 7060380d577690a40ebc201c0725076349e977cd by INADA Naoki in branch '3.6':
bpo-31672: Fix string.Template accidentally matched non-ASCII identifiers (GH-3872)
https://github.com/python/cpython/commit/7060380d577690a40ebc201c0725076349e977cd
msg304392 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-14 12:22
New changeset 073150db39408c1800e4b9e895ad0b0e195f1056 by INADA Naoki in branch 'master':
bpo-31672: doc: Remove one sentence from library/string.rst (GH-3990)
https://github.com/python/cpython/commit/073150db39408c1800e4b9e895ad0b0e195f1056
History
Date User Action Args
2017-10-14 12:22:45inada.naokisetmessages: + msg304392
2017-10-14 05:45:03inada.naokisetpull_requests: + pull_request3966
2017-10-14 05:22:05inada.naokisetmessages: + msg304376
2017-10-13 07:33:11inada.naokisetpull_requests: + pull_request3958
2017-10-13 07:02:27inada.naokisetmessages: + msg304313
2017-10-10 15:25:52barrysetmessages: + msg304053
2017-10-10 15:25:28barrysetmessages: + msg304052
2017-10-06 19:04:41serhiy.storchakasetmessages: + msg303844
2017-10-04 14:08:51barrysetmessages: + msg303696
2017-10-04 14:05:39serhiy.storchakasetmessages: + msg303694
2017-10-04 13:57:11barrysetmessages: + msg303692
2017-10-04 06:37:11inada.naokisetmessages: + msg303675
2017-10-04 06:29:15inada.naokisetmessages: + msg303674
2017-10-04 06:24:18inada.naokisetmessages: + msg303673
2017-10-04 05:55:22serhiy.storchakasetmessages: + msg303671
2017-10-04 05:34:38rhettingersetnosy: + rhettinger
messages: + msg303670
2017-10-03 16:32:21inada.naokisettype: behavior
stage: patch review
2017-10-03 16:31:54inada.naokisetmessages: + msg303627
stage: patch review -> (no value)
2017-10-03 16:27:54inada.naokisetkeywords: + patch
stage: patch review
pull_requests: + pull_request3851
2017-10-03 13:51:50barrysetmessages: + msg303618
2017-10-03 12:54:43serhiy.storchakasetnosy: + barry, serhiy.storchaka
2017-10-03 12:50:19inada.naokilinkissue31669 superseder
2017-10-03 11:15:40inada.naokicreate