Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string.Template should use re.ASCII flag #75853

Closed
methane opened this issue Oct 3, 2017 · 23 comments
Closed

string.Template should use re.ASCII flag #75853

methane opened this issue Oct 3, 2017 · 23 comments
Labels
3.7 (EOL) end of life topic-regex type-bug An unexpected behavior, bug, or error

Comments

@methane
Copy link
Member

methane commented Oct 3, 2017

BPO 31672
Nosy @warsaw, @rhettinger, @ezio-melotti, @methane, @serhiy-storchaka
PRs
  • bpo-31672: string: Use re.A | re.I flag for identifier pattern #3872
  • [3.6] bpo-31672: Fix string.Template accidentally matched non-ASCII identifiers (GH-3872) #3982
  • bpo-31672: doc: Remove one sentence from library/string.rst #3990
  • bpo-31672 - Add one last minor clarification for idpattern #4483
  • bpo-31672: Restore the former behavior whe override flags in Template. #5099
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-01-04.17:40:09.131>
    created_at = <Date 2017-10-03.11:15:40.009>
    labels = ['expert-regex', 'type-bug', '3.7']
    title = 'string.Template should use re.ASCII flag'
    updated_at = <Date 2018-01-04.17:40:09.130>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2018-01-04.17:40:09.130>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-01-04.17:40:09.131>
    closer = 'serhiy.storchaka'
    components = ['Regular Expressions']
    creation = <Date 2017-10-03.11:15:40.009>
    creator = 'methane'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 31672
    keywords = ['patch']
    message_count = 23.0
    messages = ['303596', '303618', '303627', '303670', '303671', '303673', '303674', '303675', '303692', '303694', '303696', '303844', '304052', '304053', '304313', '304376', '304392', '306350', '306607', '306651', '309468', '309472', '309475']
    nosy_count = 6.0
    nosy_names = ['barry', 'rhettinger', 'ezio.melotti', 'mrabarnett', 'methane', 'serhiy.storchaka']
    pr_nums = ['3872', '3982', '3990', '4483', '5099']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue31672'
    versions = ['Python 3.7']

    @methane
    Copy link
    Member Author

    methane commented Oct 3, 2017

    Currently, strings.Template uses re.IGNORECASE without re.ASCII:

        idpattern = r'[_a-z][_a-z0-9]*'
        flags = _re.IGNORECASE

    [a-z] matches against 'ı' (0x131, DOTLESS I) and 'ſ' (0x17f, LONG S).
    It is not intentional, and it makes re.compile slower.

    @warsaw
    Copy link
    Member

    warsaw commented Oct 3, 2017

    Technically it is an API change since flags is a part of the public API. The documentation says:

    $identifier names a substitution placeholder matching a mapping key of 
    "identifier". By default, "identifier" is restricted to any case-
    insensitive ASCII alphanumeric string (including underscores) that starts 
    with an underscore or ASCII letter. The first non-identifier character 
    after the $ character terminates this placeholder specification.
    

    This means if someone does subclass string.Template and changes the pattern to accept Unicode identifiers, then with this change they will also have to modify flags, whereas before they didn't.

    It really wasn't ever the intention to allow non-ASCII identifiers, so this is probably safe in practice. OTOH, making the change for performance reasons might be questionable, given that the regular expressions are compiled by the Template's metaclass, so unlikely to contribute significantly to overall performance wins.

    @methane
    Copy link
    Member Author

    methane commented Oct 3, 2017

    This means if someone does subclass string.Template and changes the pattern to accept Unicode identifiers, then with this change they will also have to modify flags, whereas before they didn't.

    Thank you for pointing it out.
    I removed re.A flag after compilation.

    OTOH, making the change for performance reasons might be questionable, given that the regular expressions are compiled by the Template's metaclass, so unlikely to contribute significantly to overall performance wins.

    original: import time: 2310 | 9589 | string
    patched: import time: 1420 | 8702 | string

    We can save about 900 us.

    @methane methane added the type-bug An unexpected behavior, bug, or error label Oct 3, 2017
    @rhettinger
    Copy link
    Contributor

    When optimizing, please don't make API changes.

    @serhiy-storchaka
    Copy link
    Member

    Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.

    @methane
    Copy link
    Member Author

    methane commented Oct 4, 2017

    When optimizing, please don't make API changes.

    This is not only optimization, but bugfix.

    Document of string.Template says:

    By default, "identifier" is restricted to any case-insensitive ASCII alphanumeric string (including underscores) that starts with an underscore or ASCII letter.

    So, missing re.ASCII flag is bug because non-ASCII alphabet can be matched.

    @methane
    Copy link
    Member Author

    methane commented Oct 4, 2017

    Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.

    https://docs.python.org/3.6/library/re.html#regular-expression-syntax says
    (?imsx-imsx:...)

    Anyway, Template.idpattern is documented public API too.

    @methane
    Copy link
    Member Author

    methane commented Oct 4, 2017

    Current pull request override Template.flags = re.I after class creation for backward compatibility, without any API change.

    But I'm not sure it's right approach.
    How many people who subclass string.Template expect non-ASCII match?
    If this change is bugfix for 99% of subclasses, why should we keep pitfall?

    @warsaw
    Copy link
    Member

    warsaw commented Oct 4, 2017

    On Oct 4, 2017, at 02:29, INADA Naoki report@bugs.python.org wrote:

    INADA Naoki songofacandy@gmail.com added the comment:

    Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.

    https://docs.python.org/3.6/library/re.html#regular-expression-syntax says
    (?imsx-imsx:...)

    Anyway, Template.idpattern is documented public API too.

    Too bad, because I like that approach. How hard would it be to add support for ‘a’ as a local flag? (I’m kind of surprised that it isn’t already supported - it seems like it would be useful.)

    I would like this better than hacking Template.flags after the fact. It seems like the right way to align the code with the documentation (i.e. fix a bug).

    @serhiy-storchaka
    Copy link
    Member

    See bpo-31690. But this solution can be used only in 3.7.

    @warsaw
    Copy link
    Member

    warsaw commented Oct 4, 2017

    On Oct 4, 2017, at 10:05, Serhiy Storchaka <report@bugs.python.org> wrote:

    See bpo-31690. But this solution can be used only in 3.7.

    That’s fine. I don’t think this is important enough to backport.

    @serhiy-storchaka
    Copy link
    Member

    Another solution (works in 3.6 too): set idpattern to r'(?-i:[_a-zA-Z][_a-zA-Z0-9]*)'.

    This looks pretty weird, setting the re.IGNORECASE flag and then unsetting it. But it works, and don't break the user code that changes idpattern without changing flags.

    @warsaw
    Copy link
    Member

    warsaw commented Oct 10, 2017

    On Oct 6, 2017, at 15:04, Serhiy Storchaka <report@bugs.python.org> wrote:

    Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:

    Another solution (works in 3.6 too): set idpattern to r'(?-i:[_a-zA-Z][_a-zA-Z0-9]*)'.

    This looks pretty weird, setting the re.IGNORECASE flag and then unsetting it. But it works, and don't break the user code that changes idpattern without changing flags.

    Oh, I think I like that :)

    @warsaw
    Copy link
    Member

    warsaw commented Oct 10, 2017

    With some comments to clarify of course.

    @methane
    Copy link
    Member Author

    methane commented Oct 13, 2017

    New changeset b22273e by INADA Naoki in branch 'master':
    bpo-31672: Fix string.Template accidentally matched non-ASCII identifiers (GH-3872)
    b22273e

    @methane
    Copy link
    Member Author

    methane commented Oct 14, 2017

    New changeset 7060380 by INADA Naoki in branch '3.6':
    bpo-31672: Fix string.Template accidentally matched non-ASCII identifiers (GH-3872)
    7060380

    @methane
    Copy link
    Member Author

    methane commented Oct 14, 2017

    New changeset 073150d by INADA Naoki in branch 'master':
    bpo-31672: doc: Remove one sentence from library/string.rst (GH-3990)
    073150d

    @serhiy-storchaka
    Copy link
    Member

    Is something left to do here?

    In 3.7 r'(?-i:[_a-zA-Z][_a-zA-Z0-9])' can be replaced with r'(?a:[_a-z][_a-z0-9])' if it will add clarity.

    @warsaw
    Copy link
    Member

    warsaw commented Nov 21, 2017

    @inada.naoki - I think changing this to ?a will make things more clear. I submitted a PR for that, and then once it lands, we can close this issue.

    @warsaw
    Copy link
    Member

    warsaw commented Nov 21, 2017

    New changeset e256b40 by Barry Warsaw in branch 'master':
    bpo-31672 - Add one last minor clarification for idpattern (bpo-4483)
    e256b40

    @warsaw warsaw closed this as completed Nov 21, 2017
    @serhiy-storchaka
    Copy link
    Member

    PR 5099 reverts in 3.7 a subtle behavior change made by the fix for this issue. Before merging the fix setting flags to 0 in a Template subclass made the default pattern matching only lower case letters. Currently setting flags to 0 doesn't have any effect.

    @serhiy-storchaka
    Copy link
    Member

    New changeset 87be28f by Serhiy Storchaka in branch 'master':
    bpo-31672: Restore the former behavior when override flags in Template. (bpo-5099)
    87be28f

    @serhiy-storchaka
    Copy link
    Member

    Results.

    There are 4 cases:

    1. default idpattern and flags
    2. overridden only idpattern
    3. overridden only flags
    4. overridden both idpattern and flags

    The case 1 was the one that was broken when this issue was opened. The initial Inada's version fixed the case 1, but broke the case 2. His final version (applied also to 3.6) fixed the case 1, but broke the case 3. This is a win, because cases 1 and 2 look much more common than the case 3. And finally PR 5099 has fixed also the case 3. The case 4 obviously is not affected by any changes of default values.

    Now all four cases are correct in 3.7 and the only broken case in 3.6 is the uncommon case 3.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants