Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor makeunicodedata.py: dedupe parsing, use dataclass #81941

Closed
gnprice opened this issue Aug 5, 2019 · 26 comments
Closed

Refactor makeunicodedata.py: dedupe parsing, use dataclass #81941

gnprice opened this issue Aug 5, 2019 · 26 comments
Labels
3.9 only security fixes topic-unicode type-feature A feature request or enhancement

Comments

@gnprice
Copy link
Contributor

gnprice commented Aug 5, 2019

BPO 37760
Nosy @malemburg, @benjaminp, @ezio-melotti, @serhiy-storchaka, @gnprice, @miss-islington
PRs
  • bpo-37760: Avoid cluttering work tree with downloaded Unicode files. #15128
  • bpo-37760: Constant-fold some old options in makeunicodedata. #15129
  • bpo-37760: Factor out the basic UCD parsing logic of makeunicodedata. #15130
  • bpo-37760: Mark all generated Unicode data headers as generated. #15171
  • [3.8] bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171) #15245
  • [3.8] bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171) #15246
  • bpo-37760: Factor out standard range-expanding logic in makeunicodedata. #15248
  • bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. #15265
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-09-12.09:24:26.122>
    created_at = <Date 2019-08-05.03:55:35.555>
    labels = ['type-feature', '3.9', 'expert-unicode']
    title = 'Refactor makeunicodedata.py: dedupe parsing, use dataclass'
    updated_at = <Date 2019-09-18.06:11:56.188>
    user = 'https://github.com/gnprice'

    bugs.python.org fields:

    activity = <Date 2019-09-18.06:11:56.188>
    actor = 'Greg Price'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-09-12.09:24:26.122>
    closer = 'benjamin.peterson'
    components = ['Unicode']
    creation = <Date 2019-08-05.03:55:35.555>
    creator = 'Greg Price'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 37760
    keywords = ['patch']
    message_count = 26.0
    messages = ['349020', '349023', '349518', '349519', '349520', '349524', '349525', '349526', '349548', '349549', '349550', '349588', '349592', '349599', '349602', '349603', '349613', '349630', '349631', '349641', '349671', '349783', '349785', '350053', '352068', '352698']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'benjamin.peterson', 'ezio.melotti', 'serhiy.storchaka', 'Greg Price', 'miss-islington']
    pr_nums = ['15128', '15129', '15130', '15171', '15245', '15246', '15248', '15265']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue37760'
    versions = ['Python 3.9']

    @gnprice
    Copy link
    Contributor Author

    gnprice commented Aug 5, 2019

    I spent some time yesterday on bpo-18236, and I have a patch for it.

    Most of that work happens in the script Tools/unicode/makeunicode.py , and along the way I made several changes there that I found made it somewhat nicer to work on, and I think will help other people reading that script too. I'd like to try to merge those improvements first.

    The main changes are:

    • As the script has grown over the years, it's gained many copies and reimplementations of logic to parse the standard format of the Unicode character database. I factored those out into a single place, which makes the parsing code shorter and the interesting parts stand out more easily.

    • The main per-character record type in the script's data structures is a length-18 tuple. Using the magic of dataclasses, I converted this so that e.g. the code says record.numeric_value instead of record[8].

    There's no radical restructuring or rewrite here; this script has served us well. I've kept these changes focused where there's a high ratio of value, in future ease of development, to cost, in a reviewer's effort as well as mine.

    I'll send PRs of my changes shortly.

    @gnprice gnprice added 3.9 only security fixes topic-unicode type-feature A feature request or enhancement labels Aug 5, 2019
    @gnprice
    Copy link
    Contributor Author

    gnprice commented Aug 5, 2019

    Just posted three PRs:

    Two remaining patches go on top of #59335. Here are drafts, in case they're helpful for reference:

    I figure they may be easiest to review after the PR they depend on is merged, so my plan is to send PRs for them each in turn.

    @benjaminp
    Copy link
    Contributor

    Thanks for working this. In your interested in doing some more hacking on Unicode data, there's bpo-32771.

    @benjaminp
    Copy link
    Contributor

    New changeset ef2af1a by Benjamin Peterson (Greg Price) in branch 'master':
    bpo-37760: Factor out the basic UCD parsing logic of makeunicodedata. (GH-15130)
    ef2af1a

    @miss-islington
    Copy link
    Contributor

    New changeset 4e3dfcc by Miss Islington (bot) (Greg Price) in branch 'master':
    bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171)
    4e3dfcc

    @benjaminp
    Copy link
    Contributor

    New changeset 99d208e by Benjamin Peterson (Greg Price) in branch 'master':
    bpo-37760: Constant-fold some old options in makeunicodedata. (GH-15129)
    99d208e

    @miss-islington
    Copy link
    Contributor

    New changeset c2b9d9f by Miss Islington (bot) in branch '3.8':
    bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171)
    c2b9d9f

    @miss-islington
    Copy link
    Contributor

    New changeset b02e148 by Miss Islington (bot) in branch '3.8':
    bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171)
    b02e148

    @vstinner
    Copy link
    Member

    $ find -name "*db.h"
    ./Modules/unicodedata_db.h
    ./Modules/unicodename_db.h
    ./Objects/unicodetype_db.h

    For .gitattributes, I would prefer to use unicode*_db.h pattern rather than "a generic" *_db.h, to avoid masking other potential file called "something_db.h" in the future in GitHub. But it's not really an issue right now ;-)

    @vstinner
    Copy link
    Member

    "from typing import *"

    I like to run pyflakes time to time on the Python code base. Please avoid "import *" since it prevents pyflakes (and other code analyzers) to find bugs. Example:

    $ pyflakes Tools/unicode/makeunicodedata.py
    Tools/unicode/makeunicodedata.py:35: 'from typing import *' used; unable to detect undefined names
    Tools/unicode/makeunicodedata.py:921: 'Iterator' may be undefined, or defined from star imports: typing
    Tools/unicode/makeunicodedata.py:921: 'List' may be undefined, or defined from star imports: typing
    Tools/unicode/makeunicodedata.py:921: 'Iterator' may be undefined, or defined from star imports: typing
    ...

    @malemburg
    Copy link
    Member

    BTW: Since when do we use type annotations in Python's stdlib ?

    @gnprice
    Copy link
    Contributor Author

    gnprice commented Aug 13, 2019

    I like to run pyflakes time to time on the Python code base. Please avoid "import *" since it prevents pyflakes (and other code analyzers) to find bugs.

    Ah fair enough, thanks!

    Pushed that change to the next/current PR, #59453.

    @gnprice
    Copy link
    Contributor Author

    gnprice commented Aug 13, 2019

    BTW: Since when do we use type annotations in Python's stdlib ?

    Hmm, interesting question!

    At a quick grep, it's in a handful of places in the stdlib: asyncio, functools, importlib. The earliest it appeared was in 3.7.0a4.

    It's in more places in the test suite, which I think is a closer parallel to this maintainer script in Tools/.

    The typing module itself is in the stdlib, so I don't see any obstacle to using it more widely. I imagine the main reason it doesn't appear more widely already is simply that it's new, and most of the stdlib is quite stable.

    @serhiy-storchaka
    Copy link
    Member

    What is the minimal Python version for developing CPython? The system Python 3 on current Ubuntu LTS (18.04) is 3.6, so I think it should not be larger.

    @gnprice
    Copy link
    Contributor Author

    gnprice commented Aug 13, 2019

    What is the minimal Python version for developing CPython? The system Python 3 on current Ubuntu LTS (18.04) is 3.6, so I think it should not be larger.

    Ah, I think my previous message had an ambiguous parse: the earliest that *uses* of the typing module appeared in the stdlib was 3.7. The typing module has been around longer than that.

    I just checked and python3.6 [Tools/unicode/makeunicodedata.py](https://github.com/python/cpython/blob/main/Tools/unicode/makeunicodedata.py) works fine, both at master and with #59453.

    I think it would be OK for doing development on CPython to require the latest minor version (i.e. 3.7) -- after all, if you're doing development, you're already building it, so you can always get a newer version than your system provides if needed. But happily the question is moot here, so I guess the place to discuss that further would be a new thread.

    @serhiy-storchaka
    Copy link
    Member

    I just checked and python3.6 [Tools/unicode/makeunicodedata.py](https://github.com/python/cpython/blob/main/Tools/unicode/makeunicodedata.py) works fine, both at master and with #59453.

    This is good. But the title mentioned dataclasses, and they are 3.7+.

    @gnprice
    Copy link
    Contributor Author

    gnprice commented Aug 13, 2019

    This is good. But the title mentioned dataclasses, and they are 3.7+.

    Ahh, sorry, I think now I understand you. :-)

    Indeed, when I switch to the branch with that change (gnprice@2b4aec4dd -- it comes after the patch that's #59453, so I haven't yet sent it as a PR), then python3.6 [Tools/unicode/makeunicodedata.py](https://github.com/python/cpython/blob/main/Tools/unicode/makeunicodedata.py) no longer works.

    I think this is fine. Most of all that's because this always works:

    ./python [Tools/unicode/makeunicodedata.py](https://github.com/python/cpython/blob/main/Tools/unicode/makeunicodedata.py)
    

    Anyone who's going to be running that script will want to build a ./python right afterward, in order to at least run the tests. So it doesn't seem like much trouble to do the build first and then run the script (and then a quick rebuild for the handful of changed files), if indeed the person doesn't already have a ./python lying around.

    In fact ./python is exactly what I used most of the time to run this script when I was developing these changes, simply because it seemed like the natural thing to do.

    @benjaminp
    Copy link
    Contributor

    From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.

    @benjaminp
    Copy link
    Contributor

    New changeset c03e698 by Benjamin Peterson (Greg Price) in branch 'master':
    bpo-37760: Factor out standard range-expanding logic in makeunicodedata. (GH-15248)
    c03e698

    @gnprice
    Copy link
    Contributor Author

    gnprice commented Aug 14, 2019

    From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.

    Yeah, fair concern. In fact I think I'm on video (from PyCon 2018) warning everyone not to do that in their codebases, because what you really don't want is a bunch of annotations that have gradually accumulated falsehoods as the code has changed around them.

    Still, I think from "some annotations + no checking" the good equilibrium to land in "some annotations + checking", not "no annotations + no checking". (I do mean "some" -- I don't predict we'll ever go sweep all over adding them.) And I think the highest-probability way to get there is to let them continue to accumulate where people occasionally add them in new/revised code... because that holds a door open for someone to step up to start checking them, and then to do the work to make that part of CI. (That someone might even be me! But I can think of plenty of other likely folks to do it.)

    If we accumulated quite a lot of them and nobody had yet stepped up to make checking happen, I'd worry. But with the smattering we currently have, I think that point is far off.

    @vstinner
    Copy link
    Member

    From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.

    Even if unchecked, type annotations can serve as builtin documentation, as docstrings (even when docstrings are not checked ;-)).

    Well, I don't have a strong opinion on type annotations in the Python stdlib. But here we are talking about a tool which is not installed, so I don't think that it matters much. It's not really part of the "stdlib".

    @benjaminp
    Copy link
    Contributor

    New changeset 3e4498d by Benjamin Peterson (Greg Price) in branch 'master':
    bpo-37760: Avoid cluttering work tree with downloaded Unicode files. (GH-15128)
    3e4498d

    @benjaminp
    Copy link
    Contributor

    On Wed, Aug 14, 2019, at 03:25, STINNER Victor wrote:

    STINNER Victor <vstinner@redhat.com> added the comment:

    > From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.

    Even if unchecked, type annotations can serve as builtin documentation,
    as docstrings (even when docstrings are not checked ;-)).

    Yes, but no one has the expectation that docstrings are automatically verified in any way. :)

    @gnprice
    Copy link
    Contributor Author

    gnprice commented Aug 21, 2019

    (A bit easy to miss in the way this thread gets displayed, so to highlight in a comment: #59470 is up, following the 5 other patches which have now all been merged. That's the one that replaces the length-18 tuples with a dataclass.)

    @benjaminp
    Copy link
    Contributor

    New changeset a65678c by Benjamin Peterson (Greg Price) in branch 'master':
    bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. (GH-15265)
    a65678c

    @gnprice
    Copy link
    Contributor Author

    gnprice commented Sep 18, 2019

    Thanks Benjamin for reviewing and merging this series!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants