classification
Title: Refactor makeunicodedata.py: dedupe parsing, use dataclass
Type: enhancement Stage: resolved
Components: Unicode Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Greg Price, benjamin.peterson, ezio.melotti, lemburg, miss-islington, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2019-08-05 03:55 by Greg Price, last changed 2019-09-18 06:11 by Greg Price. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15128 merged Greg Price, 2019-08-05 04:00
PR 15129 merged Greg Price, 2019-08-05 04:04
PR 15130 merged Greg Price, 2019-08-05 04:07
PR 15171 merged Greg Price, 2019-08-08 04:54
PR 15245 closed miss-islington, 2019-08-13 05:23
PR 15246 merged miss-islington, 2019-08-13 05:48
PR 15248 merged Greg Price, 2019-08-13 07:52
PR 15265 merged Greg Price, 2019-08-14 05:00
Messages (26)
msg349020 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-05 03:55
I spent some time yesterday on #18236, and I have a patch for it.

Most of that work happens in the script Tools/unicode/makeunicode.py , and along the way I made several changes there that I found made it somewhat nicer to work on, and I think will help other people reading that script too.  I'd like to try to merge those improvements first.

The main changes are:

 * As the script has grown over the years, it's gained many copies and reimplementations of logic to parse the standard format of the Unicode character database.  I factored those out into a single place, which makes the parsing code shorter and the interesting parts stand out more easily.

 * The main per-character record type in the script's data structures is a length-18 tuple.  Using the magic of dataclasses, I converted this so that e.g. the code says `record.numeric_value` instead of `record[8]`.

There's no radical restructuring or rewrite here; this script has served us well.  I've kept these changes focused where there's a high ratio of value, in future ease of development, to cost, in a reviewer's effort as well as mine.

I'll send PRs of my changes shortly.
msg349023 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-05 04:25
Just posted three PRs:
 * GH-15128 and GH-15129 are both quite small
 * GH-15130 is the first of two patches factoring out common parsing logic.

Two remaining patches go on top of GH-15130.  Here are drafts, in case they're helpful for reference:
* Patch 2/2 factoring out common parsing logic: https://github.com/gnprice/cpython/commit/0a32a7111
* Patch converting the big tuples to a dataclass: https://github.com/gnprice/cpython/commit/6d8103bbc

I figure they may be easiest to review after the PR they depend on is merged, so my plan is to send PRs for them each in turn.
msg349518 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-13 05:10
Thanks for working this. In your interested in doing some more hacking on Unicode data, there's #32771.
msg349519 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-13 05:21
New changeset ef2af1ad44be0542a47270d5173a0b920c3a450d by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37760: Factor out the basic UCD parsing logic of makeunicodedata. (GH-15130)
https://github.com/python/cpython/commit/ef2af1ad44be0542a47270d5173a0b920c3a450d
msg349520 - (view) Author: miss-islington (miss-islington) Date: 2019-08-13 05:23
New changeset 4e3dfcc4b987e683476a1b16456e57d3c9f581cb by Miss Islington (bot) (Greg Price) in branch 'master':
bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171)
https://github.com/python/cpython/commit/4e3dfcc4b987e683476a1b16456e57d3c9f581cb
msg349524 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-13 05:59
New changeset 99d208efed97e02d813e8166925b998bbd0d3993 by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37760: Constant-fold some old options in makeunicodedata. (GH-15129)
https://github.com/python/cpython/commit/99d208efed97e02d813e8166925b998bbd0d3993
msg349525 - (view) Author: miss-islington (miss-islington) Date: 2019-08-13 06:04
New changeset c2b9d9f202e4a99fc0800b7a0f0944ac4c2382e3 by Miss Islington (bot) in branch '3.8':
bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171)
https://github.com/python/cpython/commit/c2b9d9f202e4a99fc0800b7a0f0944ac4c2382e3
msg349526 - (view) Author: miss-islington (miss-islington) Date: 2019-08-13 06:06
New changeset b02e148a0d6e7a11df93a09ea5f4e1b0ad9b77b8 by Miss Islington (bot) in branch '3.8':
bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171)
https://github.com/python/cpython/commit/b02e148a0d6e7a11df93a09ea5f4e1b0ad9b77b8
msg349548 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-13 12:33
$ find -name "*db.h"
./Modules/unicodedata_db.h
./Modules/unicodename_db.h
./Objects/unicodetype_db.h

For .gitattributes, I would prefer to use unicode*_db.h pattern rather than "a generic" *_db.h, to avoid masking other potential file called "something_db.h" in the future in GitHub. But it's not really an issue right now ;-)
msg349549 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-13 12:35
"from typing import *"

I like to run pyflakes time to time on the Python code base. Please avoid "import *" since it prevents pyflakes (and other code analyzers) to find bugs. Example:

$ pyflakes Tools/unicode/makeunicodedata.py
Tools/unicode/makeunicodedata.py:35: 'from typing import *' used; unable to detect undefined names
Tools/unicode/makeunicodedata.py:921: 'Iterator' may be undefined, or defined from star imports: typing
Tools/unicode/makeunicodedata.py:921: 'List' may be undefined, or defined from star imports: typing
Tools/unicode/makeunicodedata.py:921: 'Iterator' may be undefined, or defined from star imports: typing
...
msg349550 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2019-08-13 12:49
BTW: Since when do we use type annotations in Python's stdlib ?
msg349588 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-13 17:33
> I like to run pyflakes time to time on the Python code base. Please avoid "import *" since it prevents pyflakes (and other code analyzers) to find bugs.

Ah fair enough, thanks!

Pushed that change to the next/current PR, GH-15248.
msg349592 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-13 17:46
> BTW: Since when do we use type annotations in Python's stdlib ?

Hmm, interesting question!

At a quick grep, it's in a handful of places in the stdlib: asyncio, functools, importlib. The earliest it appeared was in 3.7.0a4.

It's in more places in the test suite, which I think is a closer parallel to this maintainer script in Tools/.

The typing module itself is in the stdlib, so I don't see any obstacle to using it more widely. I imagine the main reason it doesn't appear more widely already is simply that it's new, and most of the stdlib is quite stable.
msg349599 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-08-13 18:22
What is the minimal Python version for developing CPython? The system Python 3 on current Ubuntu LTS (18.04) is 3.6, so I think it should not be larger.
msg349602 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-13 18:48
> What is the minimal Python version for developing CPython? The system Python 3 on current Ubuntu LTS (18.04) is 3.6, so I think it should not be larger.

Ah, I think my previous message had an ambiguous parse: the earliest that *uses* of the typing module appeared in the stdlib was 3.7. The typing module has been around longer than that.

I just checked and `python3.6 Tools/unicode/makeunicodedata.py` works fine, both at master and with GH-15248.

I think it would be OK for doing development on CPython to require the latest minor version (i.e. 3.7) -- after all, if you're doing development, you're already building it, so you can always get a newer version than your system provides if needed.  But happily the question is moot here, so I guess the place to discuss that further would be a new thread.
msg349603 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-08-13 18:56
> I just checked and `python3.6 Tools/unicode/makeunicodedata.py` works fine, both at master and with GH-15248.

This is good. But the title mentioned dataclasses, and they are 3.7+.
msg349613 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-13 20:02
> This is good. But the title mentioned dataclasses, and they are 3.7+.

Ahh, sorry, I think now I understand you. :-)

Indeed, when I switch to the branch with that change (https://github.com/gnprice/cpython/commit/2b4aec4dd -- it comes after the patch that's GH-15248, so I haven't yet sent it as a PR), then `python3.6 Tools/unicode/makeunicodedata.py` no longer works.

I think this is fine. Most of all that's because this always works:

    ./python Tools/unicode/makeunicodedata.py

Anyone who's going to be running that script will want to build a `./python` right afterward, in order to at least run the tests. So it doesn't seem like much trouble to do the build first and then run the script (and then a quick rebuild for the handful of changed files), if indeed the person doesn't already have a `./python` lying around.

In fact `./python` is exactly what I used most of the time to run this script when I was developing these changes, simply because it seemed like the natural thing to do.
msg349630 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-14 02:27
From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.
msg349631 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-14 02:28
New changeset c03e698c344dfc557555b6b07a3ee2702e45f6ee by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37760: Factor out standard range-expanding logic in makeunicodedata. (GH-15248)
https://github.com/python/cpython/commit/c03e698c344dfc557555b6b07a3ee2702e45f6ee
msg349641 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-14 04:55
> From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.

Yeah, fair concern. In fact I think I'm on video (from PyCon 2018) warning everyone not to do that in their codebases, because what you really don't want is a bunch of annotations that have gradually accumulated falsehoods as the code has changed around them.

Still, I think from "some annotations + no checking" the good equilibrium to land in "some annotations + checking", not "no annotations + no checking". (I do mean "some" -- I don't predict we'll ever go sweep all over adding them.) And I think the highest-probability way to get there is to let them continue to accumulate where people occasionally add them in new/revised code... because that holds a door open for someone to step up to start checking them, and then to do the work to make that part of CI. (That someone might even be me! But I can think of plenty of other likely folks to do it.)

If we accumulated quite a lot of them and nobody had yet stepped up to make checking happen, I'd worry.  But with the smattering we currently have, I think that point is far off.
msg349671 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 10:25
> From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.

Even if unchecked, type annotations can serve as builtin documentation, as docstrings (even when docstrings are not checked ;-)).

Well, I don't have a strong opinion on type annotations in the Python stdlib. But here we are talking about a tool which is not installed, so I don't think that it matters much. It's not really part of the "stdlib".
msg349783 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-15 01:18
New changeset 3e4498d35c34aeaf4a9c3d57509b0d3277048ac6 by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37760: Avoid cluttering work tree with downloaded Unicode files. (GH-15128)
https://github.com/python/cpython/commit/3e4498d35c34aeaf4a9c3d57509b0d3277048ac6
msg349785 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-15 01:24
On Wed, Aug 14, 2019, at 03:25, STINNER Victor wrote:
> 
> STINNER Victor <vstinner@redhat.com> added the comment:
> 
> > From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.
> 
> Even if unchecked, type annotations can serve as builtin documentation, 
> as docstrings (even when docstrings are not checked ;-)).

Yes, but no one has the expectation that docstrings are automatically verified in any way. :)
msg350053 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-21 05:48
(A bit easy to miss in the way this thread gets displayed, so to highlight in a comment: GH-15265 is up, following the 5 other patches which have now all been merged.  That's the one that replaces the length-18 tuples with a dataclass.)
msg352068 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-09-12 09:23
New changeset a65678c5c90002c5e40fa82746de07e6217df625 by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. (GH-15265)
https://github.com/python/cpython/commit/a65678c5c90002c5e40fa82746de07e6217df625
msg352698 - (view) Author: Greg Price (Greg Price) * Date: 2019-09-18 06:11
Thanks Benjamin for reviewing and merging this series!
History
Date User Action Args
2019-09-18 06:11:56Greg Pricesetmessages: + msg352698
2019-09-12 16:11:12vstinnersetnosy: - vstinner
2019-09-12 09:24:26benjamin.petersonsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-09-12 09:23:46benjamin.petersonsetmessages: + msg352068
2019-08-21 05:48:50Greg Pricesetmessages: + msg350053
2019-08-15 01:24:14benjamin.petersonsetmessages: + msg349785
2019-08-15 01:18:59benjamin.petersonsetmessages: + msg349783
2019-08-14 10:25:36vstinnersetmessages: + msg349671
2019-08-14 05:00:06Greg Pricesetpull_requests: + pull_request14985
2019-08-14 04:55:24Greg Pricesetmessages: + msg349641
2019-08-14 02:28:45benjamin.petersonsetmessages: + msg349631
2019-08-14 02:27:14benjamin.petersonsetmessages: + msg349630
2019-08-13 20:02:05Greg Pricesetmessages: + msg349613
2019-08-13 18:56:02serhiy.storchakasetmessages: + msg349603
2019-08-13 18:48:18Greg Pricesetmessages: + msg349602
2019-08-13 18:22:50serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg349599
2019-08-13 17:46:47Greg Pricesetmessages: + msg349592
2019-08-13 17:33:32Greg Pricesetmessages: + msg349588
2019-08-13 12:49:27lemburgsetnosy: + lemburg
messages: + msg349550
2019-08-13 12:35:15vstinnersetmessages: + msg349549
2019-08-13 12:33:28vstinnersetmessages: + msg349548
2019-08-13 07:52:27Greg Pricesetpull_requests: + pull_request14969
2019-08-13 06:06:19miss-islingtonsetmessages: + msg349526
2019-08-13 06:04:24miss-islingtonsetmessages: + msg349525
2019-08-13 05:59:33benjamin.petersonsetmessages: + msg349524
2019-08-13 05:48:10miss-islingtonsetpull_requests: + pull_request14967
2019-08-13 05:23:51miss-islingtonsetpull_requests: + pull_request14966
2019-08-13 05:23:45miss-islingtonsetnosy: + miss-islington
messages: + msg349520
2019-08-13 05:21:06benjamin.petersonsetmessages: + msg349519
2019-08-13 05:10:33benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg349518
2019-08-08 04:54:07Greg Pricesetpull_requests: + pull_request14903
2019-08-05 04:25:24Greg Pricesetmessages: + msg349023
2019-08-05 04:07:35Greg Pricesetpull_requests: + pull_request14870
2019-08-05 04:04:27Greg Pricesetpull_requests: + pull_request14869
2019-08-05 04:00:50Greg Pricesetkeywords: + patch
stage: patch review
pull_requests: + pull_request14868
2019-08-05 03:55:35Greg Pricecreate