Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 515: Tokenizer: allow underscores for grouping in numeric literals #70519

Closed
birkenfeld opened this issue Feb 10, 2016 · 41 comments
Closed
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-feature A feature request or enhancement

Comments

@birkenfeld
Copy link
Member

BPO 26331
Nosy @brettcannon, @birkenfeld, @rhettinger, @mdickinson, @scoder, @ericvsmith, @ned-deily, @encukou, @skrah, @ethanfurman, @serhiy-storchaka, @1st1, @zhangyangyu
Files
  • numeric_underscores_strict.patch
  • numeric_underscores_v4_full.diff
  • numeric_underscores_final_v6.diff
  • numeric_underscores_final_v7.diff
  • numeric_underscores_final_v8.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/brettcannon'
    closed_at = <Date 2016-09-09.21:58:37.474>
    created_at = <Date 2016-02-10.17:50:24.454>
    labels = ['interpreter-core', 'type-feature', 'release-blocker']
    title = 'PEP 515: Tokenizer: allow underscores for grouping in numeric literals'
    updated_at = <Date 2016-09-11.16:20:07.708>
    user = 'https://github.com/birkenfeld'

    bugs.python.org fields:

    activity = <Date 2016-09-11.16:20:07.708>
    actor = 'georg.brandl'
    assignee = 'brett.cannon'
    closed = True
    closed_date = <Date 2016-09-09.21:58:37.474>
    closer = 'brett.cannon'
    components = ['Interpreter Core']
    creation = <Date 2016-02-10.17:50:24.454>
    creator = 'georg.brandl'
    dependencies = []
    files = ['41893', '41896', '42854', '42887', '42939']
    hgrepos = []
    issue_num = 26331
    keywords = ['patch']
    message_count = 41.0
    messages = ['260026', '260029', '260031', '260033', '260036', '260037', '260046', '260057', '260077', '260081', '260093', '260102', '260229', '260230', '260231', '260232', '260233', '260235', '260239', '260240', '260241', '262037', '262043', '262050', '265587', '265589', '265616', '265804', '265821', '265825', '266024', '266058', '272834', '273384', '275148', '275262', '275364', '275461', '275463', '275551', '275804']
    nosy_count = 14.0
    nosy_names = ['brett.cannon', 'georg.brandl', 'rhettinger', 'mark.dickinson', 'scoder', 'eric.smith', 'ned.deily', 'petr.viktorin', 'skrah', 'ethan.furman', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'xiang.zhang']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue26331'
    versions = ['Python 3.6']

    @birkenfeld
    Copy link
    Member Author

    As discussed on python-ideas: https://mail.python.org/pipermail/python-ideas/2016-February/038354.html

    The rules are:
    Underscores are allowed anywhere in numeric literals, except:

    • at the beginning of a literal (obviously)
    • at the end of a literal
    • directly after a dot (since the underscore could start an attribute name)
    • directly after a sign in exponents (for consistency with leading signs)
    • in the middle of the "0x", "0o" or "0b" base specifiers

    Currently this only touches literals, not the inputs of int() or float(). Whether they should accept this syntax is debatable (I'd vote no).

    Otherwise missing: doc updates.

    Review question: is PyMem_RawStrdup/RawFree the right API to use here?

    @birkenfeld birkenfeld added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Feb 10, 2016
    @serhiy-storchaka
    Copy link
    Member

    I prefer simpler and more strict rule:

    • Underscores are allowed only between digits in numeric literals.

    Thus 1__2, 12_, 1_.2, 1_e2, 1e_2, 1_j, 0x_12 are not allowed.

    It is easier to make the rule more lenient later if it will be needed.

    @birkenfeld
    Copy link
    Member Author

    It sure is more strict, but I don't think it's simpler (and it's definitely not simpler to implement).

    (Also 1_j is pretty nice, I wouldn't want to lose that.)

    We can also check what other languages do.

    • Rust: very much like this, but trailing underscores allowed.

    • Perl 5: same as here, but underscores after dot and trailing underscores allowed.

    • Ruby: only between digits.

    • Swift: the grammar productions say it's basically the same as Rust. The textual description says "between digits".

    @serhiy-storchaka
    Copy link
    Member

    • Java: only between digits. [1]
    • Julia: only between digits. [2] (not well specified)
    • C# 7.0 (proposal): only between digits, but adjacent underscores allowed. [3]
    • Ada: only between digits. [4] (strong but very simple rules)
    • D: very much like proposed patch, but trailing underscores allowed. [5]
    • Perl 5: only between digits as documented (23__500 is not legal), but actually more lenient. [6]

    [1] https://docs.oracle.com/javase/7/docs/technotes/guides/language/underscores-literals.html
    [2] http://docs.julialang.org/en/release-0.4/manual/integers-and-floating-point-numbers/
    [3] dotnet/roslyn#216
    [4] http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#2.4
    [5] http://dlang.org/spec/lex.html#integerliteral
    [6] http://perldoc.perl.org/perldata.html#Scalar-value-constructors

    @1st1
    Copy link
    Member

    1st1 commented Feb 10, 2016

    I prefer simpler and more strict rule:

    • Underscores are allowed only between digits in numeric literals.

    +1. But in any case we need a PEP for this change.

    @serhiy-storchaka
    Copy link
    Member

    C++14 uses the same strict rule as Ada, but uses apostrphes instead of underscores. [1]

    Thus there are two groups of languages, implementing strict or lenient rules:

    • Strict: Ada, C++, Java, C#, Ruby, Julia, Perl (as documented), Swift (textual description).
    • Lenient: D, Rust, Perl (actually), Swift (grammar productions).

    [1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3499.html

    @birkenfeld
    Copy link
    Member Author

    PEP-515 is written up and posted to python-dev.

    @encukou
    Copy link
    Member

    encukou commented Feb 10, 2016

    Regarding the patch: if trailing underscores are not allowed, 0 if 1_____else 1 should be illegal.

    @birkenfeld
    Copy link
    Member Author

    New patch matching revision of PEP.

    @serhiy-storchaka
    Copy link
    Member

    Proposed patch implements strict underscore rules. The implementation is not more complex.

    @birkenfeld
    Copy link
    Member Author

    This patch includes int(), float(), complex() operations, as well as _pydecimal.

    @birkenfeld
    Copy link
    Member Author

    New patch with minimal doc updates.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 13, 2016

    I like the feature for literals, but I'm not sure about conversions from string. It slows down the conversion for (IMO) a very small benefit.

    Other languages allow it, but I've never attempted to use the feature:

    $ ocaml
            OCaml version 4.02.1

    # float_of_string "__1____2____.___e___101_";;

    • : float = 1.2e+102

    @birkenfeld
    Copy link
    Member Author

    It's mostly for consistency. For example, int(x, 0) is defined by the docs as "interpret x as in a literal". Other bases have special cases as well, e.g. "0x" is accepted by base 16.

    In the current version of the conversions, the string is scanned for "_" before doing the more expensive allocation+copy.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 13, 2016

    If the string conversions stay, may I suggest two functions:

    1. PyUnicode_NumericAsAscii()
    2. PyUnicode_NumericAsAsciiWS()

    The first one eliminates only underscores, the second one both
    underscores and leading/trailing whitespace.

    Decimal must support both:

    https://hg.python.org/cpython/file/default/Modules/_decimal/_decimal.c#l1890

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 13, 2016

    Correction: The explanation of the functions should be reversed.

    @birkenfeld
    Copy link
    Member Author

    Thanks, I hadn't looked at cdecimal yet - I was planning to ask you to do the necessary changes there :)

    But there are a few versions of this (e.g. converting unicode digits to ASCII) scattered throughout the codebase, it would make sense to consolidate on this occasion.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 13, 2016

    Georg Brandl added the comment:

    Thanks, I hadn't looked at cdecimal yet - I was planning to ask you to do the necessary changes there :)

    Oh, well. :)

    But there are a few versions of this (e.g. converting unicode digits to ASCII) scattered throughout the codebase, it would make sense to consolidate on this occasion.

    Yes, actually I have to look at the _decimal version again, it contains
    some optimizations that may only work for _decimal:

    https://hg.python.org/cpython/file/default/Modules/_decimal/_decimal.c#l1943

    I *did* optimize it for speed at the time, I hope general functions won't be
    slower.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 13, 2016

    I still wonder about the complexity of all this for decimal. We now have two grammars on top of each other, this being the actual one for decimal:

    http://speleotrove.com/decimal/daconvs.html

    For string conversions I'd prefer a lax way (similar to OCaml) that would somehow be specified in terms of preprocessing, same as the leading/trailing whitespace removal. Short of "ignore all underscores" it isn't easy though.

    @birkenfeld
    Copy link
    Member Author

    Hm. On the one hand there is a spec, so it can be argued that underscores don't belong to Decimal.

    On the other hand, if we get Decimal literals at one point, there will be a strong argument for allowing underscores in them as in all other number literals.

    Although supporting them in strings can also be added at that time.

    @birkenfeld
    Copy link
    Member Author

    Raymond, you've also worked on Decimal - do you have an opinion on allowing underscores in Decimal(string) conversions?

    @scoder
    Copy link
    Contributor

    scoder commented Mar 19, 2016

    Nice one. While reimplementing it for Cython, I noticed that the grammar described in the PEP isn't exactly as it's implemented, though. The grammar says

    digit (["_"] digit)*
    

    whereas the latest patch (v4) says

    `digit` (`digit` | "_")*
    

    and also implements it that way. The former doesn't allow underscores at the end of a literal.

    And the regexes in tokenize.py seem happy to accept "0x___", for example. Is that intended?

    @birkenfeld
    Copy link
    Member Author

    The last patch isn't up to date with the PEP; Serhiy's patch is the closest one.

    @scoder
    Copy link
    Contributor

    scoder commented Mar 19, 2016

    Ah, thanks. Here's my implementation then:

    https://github.com/cython/cython/pull/499/files

    It seems that tests for valid complex literals are missing. I've added these to the end of the list:

    '1_00_00.5j',
    '1_00_00.5e5',
    '1_00_00j',
    '1_00_00e5_1',
    '1e1_0',
    '.1_4',
    '.1_4e1',
    '.1_4j',
    

    @birkenfeld
    Copy link
    Member Author

    New patch; implements the accepted version of the PEP. I added the additional tests, thanks Stefan!

    @birkenfeld
    Copy link
    Member Author

    Note: the changes for format()ting ("_" as thousands separator) are still missing. Eric, would you consider doing this part?

    @ericvsmith
    Copy link
    Member

    Yes, I'll read PEP-515 and work on the formatting.

    @birkenfeld
    Copy link
    Member Author

    Thanks Eric!

    Serhiy, do you want to do a review? The v6/v7 patches are based on your "strict" patch with the constructor changes adapted from v4.

    New version v7 addresses the review comments from Stefan and Martin.

    @serhiy-storchaka
    Copy link
    Member

    Added comments on Rietveld.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented May 18, 2016

    Thanks, Georg! The decimal parts look good to me. I understand that
    people wonder about the relaxed rules for Decimal -- we have discussed
    that here:

    https://mail.python.org/pipermail/python-dev/2016-March/143557.html
    

    I don't think that it will be a problem in practice.

    @ericvsmith
    Copy link
    Member

    I've created bpo-27080 to track the formatting part of this.

    @birkenfeld
    Copy link
    Member Author

    Thanks for the detailed review, Serhiy! Next try incoming.

    @birkenfeld
    Copy link
    Member Author

    @Serhiy/anyone: can I get another review, so that we can commit this in time for beta? Thanks!

    @vstinner vstinner changed the title Tokenizer: allow underscores for grouping in numeric literals PEP 515: Tokenizer: allow underscores for grouping in numeric literals Aug 17, 2016
    @zhangyangyu
    Copy link
    Member

    Hi Georg, I left several comments on Rietveld. Hope it helps.

    @brettcannon
    Copy link
    Member

    Georg, do you think you will be able to get this in for 3.6b1? If not I can commit it while I'm at the core sprint. I'll wait until tomorrow to see if you reply, otherwise I'm just going to address patch comments and then commit it on your behalf.

    @birkenfeld
    Copy link
    Member Author

    Please go ahead. Thanks for taking care of this!

    @brettcannon
    Copy link
    Member

    I'll get this committed today (patch still applies and passes the tests, so it should only take addressing the review comments and the What's New entry).

    @brettcannon brettcannon self-assigned this Sep 9, 2016
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 9, 2016

    New changeset 8a881dafe335 by Brett Cannon in branch 'default':
    Issue bpo-26331: Implement the parsing part of PEP-515.
    https://hg.python.org/cpython/rev/8a881dafe335

    @brettcannon
    Copy link
    Member

    All applied! And Eric said he will handle the patch for format() which should cover the other half of PEP-515. Once Eric's side is done I guess we can mark PEP-515 as final.

    @ericvsmith
    Copy link
    Member

    I'm done with the formatting (bpo-27080), so PEP-515 can be marked as final.

    @birkenfeld
    Copy link
    Member Author

    Thanks Brett!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants