Issue 37760: Refactor makeunicodedata.py: dedupe parsing, use dataclass

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/81941

classification

Title:	Refactor makeunicodedata.py: dedupe parsing, use dataclass
Type:	enhancement	Stage:	resolved
Components:	Unicode	Versions:	Python 3.9

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Greg Price, benjamin.peterson, ezio.melotti, lemburg, miss-islington, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2019-08-05 03:55 by Greg Price, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 15128	merged	Greg Price, 2019-08-05 04:00
PR 15129	merged	Greg Price, 2019-08-05 04:04
PR 15130	merged	Greg Price, 2019-08-05 04:07
PR 15171	merged	Greg Price, 2019-08-08 04:54
PR 15245	closed	miss-islington, 2019-08-13 05:23
PR 15246	merged	miss-islington, 2019-08-13 05:48
PR 15248	merged	Greg Price, 2019-08-13 07:52
PR 15265	merged	Greg Price, 2019-08-14 05:00

Messages (26)
msg349020 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-05 03:55
I spent some time yesterday on #18236, and I have a patch for it. Most of that work happens in the script Tools/unicode/makeunicode.py , and along the way I made several changes there that I found made it somewhat nicer to work on, and I think will help other people reading that script too. I'd like to try to merge those improvements first. The main changes are: * As the script has grown over the years, it's gained many copies and reimplementations of logic to parse the standard format of the Unicode character database. I factored those out into a single place, which makes the parsing code shorter and the interesting parts stand out more easily. * The main per-character record type in the script's data structures is a length-18 tuple. Using the magic of dataclasses, I converted this so that e.g. the code says `record.numeric_value` instead of `record[8]`. There's no radical restructuring or rewrite here; this script has served us well. I've kept these changes focused where there's a high ratio of value, in future ease of development, to cost, in a reviewer's effort as well as mine. I'll send PRs of my changes shortly.
msg349023 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-05 04:25
Just posted three PRs: * GH-15128 and GH-15129 are both quite small * GH-15130 is the first of two patches factoring out common parsing logic. Two remaining patches go on top of GH-15130. Here are drafts, in case they're helpful for reference: * Patch 2/2 factoring out common parsing logic: https://github.com/gnprice/cpython/commit/0a32a7111 * Patch converting the big tuples to a dataclass: https://github.com/gnprice/cpython/commit/6d8103bbc I figure they may be easiest to review after the PR they depend on is merged, so my plan is to send PRs for them each in turn.
msg349518 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2019-08-13 05:10
Thanks for working this. In your interested in doing some more hacking on Unicode data, there's #32771.
msg349519 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2019-08-13 05:21
New changeset ef2af1ad44be0542a47270d5173a0b920c3a450d by Benjamin Peterson (Greg Price) in branch 'master': bpo-37760: Factor out the basic UCD parsing logic of makeunicodedata. (GH-15130) https://github.com/python/cpython/commit/ef2af1ad44be0542a47270d5173a0b920c3a450d
msg349520 - (view)	Author: miss-islington (miss-islington)	Date: 2019-08-13 05:23
New changeset 4e3dfcc4b987e683476a1b16456e57d3c9f581cb by Miss Islington (bot) (Greg Price) in branch 'master': bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171) https://github.com/python/cpython/commit/4e3dfcc4b987e683476a1b16456e57d3c9f581cb
msg349524 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2019-08-13 05:59
New changeset 99d208efed97e02d813e8166925b998bbd0d3993 by Benjamin Peterson (Greg Price) in branch 'master': bpo-37760: Constant-fold some old options in makeunicodedata. (GH-15129) https://github.com/python/cpython/commit/99d208efed97e02d813e8166925b998bbd0d3993
msg349525 - (view)	Author: miss-islington (miss-islington)	Date: 2019-08-13 06:04
New changeset c2b9d9f202e4a99fc0800b7a0f0944ac4c2382e3 by Miss Islington (bot) in branch '3.8': bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171) https://github.com/python/cpython/commit/c2b9d9f202e4a99fc0800b7a0f0944ac4c2382e3
msg349526 - (view)	Author: miss-islington (miss-islington)	Date: 2019-08-13 06:06
New changeset b02e148a0d6e7a11df93a09ea5f4e1b0ad9b77b8 by Miss Islington (bot) in branch '3.8': bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171) https://github.com/python/cpython/commit/b02e148a0d6e7a11df93a09ea5f4e1b0ad9b77b8
msg349548 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-08-13 12:33
$ find -name "db.h" ./Modules/unicodedata_db.h ./Modules/unicodename_db.h ./Objects/unicodetype_db.h For .gitattributes, I would prefer to use unicode_db.h pattern rather than "a generic" *_db.h, to avoid masking other potential file called "something_db.h" in the future in GitHub. But it's not really an issue right now ;-)
msg349549 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-08-13 12:35
"from typing import " I like to run pyflakes time to time on the Python code base. Please avoid "import " since it prevents pyflakes (and other code analyzers) to find bugs. Example: $ pyflakes Tools/unicode/makeunicodedata.py Tools/unicode/makeunicodedata.py:35: 'from typing import *' used; unable to detect undefined names Tools/unicode/makeunicodedata.py:921: 'Iterator' may be undefined, or defined from star imports: typing Tools/unicode/makeunicodedata.py:921: 'List' may be undefined, or defined from star imports: typing Tools/unicode/makeunicodedata.py:921: 'Iterator' may be undefined, or defined from star imports: typing ...
msg349550 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2019-08-13 12:49
BTW: Since when do we use type annotations in Python's stdlib ?
msg349588 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-13 17:33
> I like to run pyflakes time to time on the Python code base. Please avoid "import *" since it prevents pyflakes (and other code analyzers) to find bugs. Ah fair enough, thanks! Pushed that change to the next/current PR, GH-15248.
msg349592 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-13 17:46
> BTW: Since when do we use type annotations in Python's stdlib ? Hmm, interesting question! At a quick grep, it's in a handful of places in the stdlib: asyncio, functools, importlib. The earliest it appeared was in 3.7.0a4. It's in more places in the test suite, which I think is a closer parallel to this maintainer script in Tools/. The typing module itself is in the stdlib, so I don't see any obstacle to using it more widely. I imagine the main reason it doesn't appear more widely already is simply that it's new, and most of the stdlib is quite stable.
msg349599 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-08-13 18:22
What is the minimal Python version for developing CPython? The system Python 3 on current Ubuntu LTS (18.04) is 3.6, so I think it should not be larger.
msg349602 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-13 18:48
> What is the minimal Python version for developing CPython? The system Python 3 on current Ubuntu LTS (18.04) is 3.6, so I think it should not be larger. Ah, I think my previous message had an ambiguous parse: the earliest that uses of the typing module appeared in the stdlib was 3.7. The typing module has been around longer than that. I just checked and `python3.6 Tools/unicode/makeunicodedata.py` works fine, both at master and with GH-15248. I think it would be OK for doing development on CPython to require the latest minor version (i.e. 3.7) -- after all, if you're doing development, you're already building it, so you can always get a newer version than your system provides if needed. But happily the question is moot here, so I guess the place to discuss that further would be a new thread.
msg349603 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-08-13 18:56
> I just checked and `python3.6 Tools/unicode/makeunicodedata.py` works fine, both at master and with GH-15248. This is good. But the title mentioned dataclasses, and they are 3.7+.
msg349613 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-13 20:02
> This is good. But the title mentioned dataclasses, and they are 3.7+. Ahh, sorry, I think now I understand you. :-) Indeed, when I switch to the branch with that change (https://github.com/gnprice/cpython/commit/2b4aec4dd -- it comes after the patch that's GH-15248, so I haven't yet sent it as a PR), then `python3.6 Tools/unicode/makeunicodedata.py` no longer works. I think this is fine. Most of all that's because this always works: ./python Tools/unicode/makeunicodedata.py Anyone who's going to be running that script will want to build a `./python` right afterward, in order to at least run the tests. So it doesn't seem like much trouble to do the build first and then run the script (and then a quick rebuild for the handful of changed files), if indeed the person doesn't already have a `./python` lying around. In fact `./python` is exactly what I used most of the time to run this script when I was developing these changes, simply because it seemed like the natural thing to do.
msg349630 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2019-08-14 02:27
From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.
msg349631 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2019-08-14 02:28
New changeset c03e698c344dfc557555b6b07a3ee2702e45f6ee by Benjamin Peterson (Greg Price) in branch 'master': bpo-37760: Factor out standard range-expanding logic in makeunicodedata. (GH-15248) https://github.com/python/cpython/commit/c03e698c344dfc557555b6b07a3ee2702e45f6ee
msg349641 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-14 04:55
> From my perspective, the main problem with using type annotations is that there's nothing checking them in CI. Yeah, fair concern. In fact I think I'm on video (from PyCon 2018) warning everyone not to do that in their codebases, because what you really don't want is a bunch of annotations that have gradually accumulated falsehoods as the code has changed around them. Still, I think from "some annotations + no checking" the good equilibrium to land in "some annotations + checking", not "no annotations + no checking". (I do mean "some" -- I don't predict we'll ever go sweep all over adding them.) And I think the highest-probability way to get there is to let them continue to accumulate where people occasionally add them in new/revised code... because that holds a door open for someone to step up to start checking them, and then to do the work to make that part of CI. (That someone might even be me! But I can think of plenty of other likely folks to do it.) If we accumulated quite a lot of them and nobody had yet stepped up to make checking happen, I'd worry. But with the smattering we currently have, I think that point is far off.
msg349671 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-08-14 10:25
> From my perspective, the main problem with using type annotations is that there's nothing checking them in CI. Even if unchecked, type annotations can serve as builtin documentation, as docstrings (even when docstrings are not checked ;-)). Well, I don't have a strong opinion on type annotations in the Python stdlib. But here we are talking about a tool which is not installed, so I don't think that it matters much. It's not really part of the "stdlib".
msg349783 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2019-08-15 01:18
New changeset 3e4498d35c34aeaf4a9c3d57509b0d3277048ac6 by Benjamin Peterson (Greg Price) in branch 'master': bpo-37760: Avoid cluttering work tree with downloaded Unicode files. (GH-15128) https://github.com/python/cpython/commit/3e4498d35c34aeaf4a9c3d57509b0d3277048ac6
msg349785 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2019-08-15 01:24
On Wed, Aug 14, 2019, at 03:25, STINNER Victor wrote: > > STINNER Victor <vstinner@redhat.com> added the comment: > > > From my perspective, the main problem with using type annotations is that there's nothing checking them in CI. > > Even if unchecked, type annotations can serve as builtin documentation, > as docstrings (even when docstrings are not checked ;-)). Yes, but no one has the expectation that docstrings are automatically verified in any way. :)
msg350053 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-21 05:48
(A bit easy to miss in the way this thread gets displayed, so to highlight in a comment: GH-15265 is up, following the 5 other patches which have now all been merged. That's the one that replaces the length-18 tuples with a dataclass.)
msg352068 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2019-09-12 09:23
New changeset a65678c5c90002c5e40fa82746de07e6217df625 by Benjamin Peterson (Greg Price) in branch 'master': bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. (GH-15265) https://github.com/python/cpython/commit/a65678c5c90002c5e40fa82746de07e6217df625
msg352698 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-09-18 06:11
Thanks Benjamin for reviewing and merging this series!

History
Date	User	Action	Args
2022-04-11 14:59:18	admin	set	github: 81941
2019-09-18 06:11:56	Greg Price	set	messages: + msg352698
2019-09-12 16:11:12	vstinner	set	nosy: - vstinner
2019-09-12 09:24:26	benjamin.peterson	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2019-09-12 09:23:46	benjamin.peterson	set	messages: + msg352068
2019-08-21 05:48:50	Greg Price	set	messages: + msg350053
2019-08-15 01:24:14	benjamin.peterson	set	messages: + msg349785
2019-08-15 01:18:59	benjamin.peterson	set	messages: + msg349783
2019-08-14 10:25:36	vstinner	set	messages: + msg349671
2019-08-14 05:00:06	Greg Price	set	pull_requests: + pull_request14985
2019-08-14 04:55:24	Greg Price	set	messages: + msg349641
2019-08-14 02:28:45	benjamin.peterson	set	messages: + msg349631
2019-08-14 02:27:14	benjamin.peterson	set	messages: + msg349630
2019-08-13 20:02:05	Greg Price	set	messages: + msg349613
2019-08-13 18:56:02	serhiy.storchaka	set	messages: + msg349603
2019-08-13 18:48:18	Greg Price	set	messages: + msg349602
2019-08-13 18:22:50	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg349599
2019-08-13 17:46:47	Greg Price	set	messages: + msg349592
2019-08-13 17:33:32	Greg Price	set	messages: + msg349588
2019-08-13 12:49:27	lemburg	set	nosy: + lemburg messages: + msg349550
2019-08-13 12:35:15	vstinner	set	messages: + msg349549
2019-08-13 12:33:28	vstinner	set	messages: + msg349548
2019-08-13 07:52:27	Greg Price	set	pull_requests: + pull_request14969
2019-08-13 06:06:19	miss-islington	set	messages: + msg349526
2019-08-13 06:04:24	miss-islington	set	messages: + msg349525
2019-08-13 05:59:33	benjamin.peterson	set	messages: + msg349524
2019-08-13 05:48:10	miss-islington	set	pull_requests: + pull_request14967
2019-08-13 05:23:51	miss-islington	set	pull_requests: + pull_request14966
2019-08-13 05:23:45	miss-islington	set	nosy: + miss-islington messages: + msg349520
2019-08-13 05:21:06	benjamin.peterson	set	messages: + msg349519
2019-08-13 05:10:33	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg349518
2019-08-08 04:54:07	Greg Price	set	pull_requests: + pull_request14903
2019-08-05 04:25:24	Greg Price	set	messages: + msg349023
2019-08-05 04:07:35	Greg Price	set	pull_requests: + pull_request14870
2019-08-05 04:04:27	Greg Price	set	pull_requests: + pull_request14869
2019-08-05 04:00:50	Greg Price	set	keywords: + patch stage: patch review pull_requests: + pull_request14868
2019-08-05 03:55:35	Greg Price	create