Refactor makeunicodedata.py: dedupe parsing, use dataclass #81941

gnprice · 2019-08-05T03:55:36Z

BPO	37760
Nosy	@malemburg, @benjaminp, @ezio-melotti, @serhiy-storchaka, @gnprice, @miss-islington
PRs	bpo-37760: Avoid cluttering work tree with downloaded Unicode files. #15128 bpo-37760: Constant-fold some old options in makeunicodedata. #15129 bpo-37760: Factor out the basic UCD parsing logic of makeunicodedata. #15130 bpo-37760: Mark all generated Unicode data headers as generated. #15171 [3.8] bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171) #15245 [3.8] bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171) #15246 bpo-37760: Factor out standard range-expanding logic in makeunicodedata. #15248 bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. #15265

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-09-12.09:24:26.122>
created_at = <Date 2019-08-05.03:55:35.555>
labels = ['type-feature', '3.9', 'expert-unicode']
title = 'Refactor makeunicodedata.py: dedupe parsing, use dataclass'
updated_at = <Date 2019-09-18.06:11:56.188>
user = 'https://github.com/gnprice'

bugs.python.org fields:

activity = <Date 2019-09-18.06:11:56.188>
actor = 'Greg Price'
assignee = 'none'
closed = True
closed_date = <Date 2019-09-12.09:24:26.122>
closer = 'benjamin.peterson'
components = ['Unicode']
creation = <Date 2019-08-05.03:55:35.555>
creator = 'Greg Price'
dependencies = []
files = []
hgrepos = []
issue_num = 37760
keywords = ['patch']
message_count = 26.0
messages = ['349020', '349023', '349518', '349519', '349520', '349524', '349525', '349526', '349548', '349549', '349550', '349588', '349592', '349599', '349602', '349603', '349613', '349630', '349631', '349641', '349671', '349783', '349785', '350053', '352068', '352698']
nosy_count = 6.0
nosy_names = ['lemburg', 'benjamin.peterson', 'ezio.melotti', 'serhiy.storchaka', 'Greg Price', 'miss-islington']
pr_nums = ['15128', '15129', '15130', '15171', '15245', '15246', '15248', '15265']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue37760'
versions = ['Python 3.9']

gnprice · 2019-08-05T03:55:35Z

I spent some time yesterday on bpo-18236, and I have a patch for it.

Most of that work happens in the script Tools/unicode/makeunicode.py , and along the way I made several changes there that I found made it somewhat nicer to work on, and I think will help other people reading that script too. I'd like to try to merge those improvements first.

The main changes are:

As the script has grown over the years, it's gained many copies and reimplementations of logic to parse the standard format of the Unicode character database. I factored those out into a single place, which makes the parsing code shorter and the interesting parts stand out more easily.
The main per-character record type in the script's data structures is a length-18 tuple. Using the magic of dataclasses, I converted this so that e.g. the code says record.numeric_value instead of record[8].

There's no radical restructuring or rewrite here; this script has served us well. I've kept these changes focused where there's a high ratio of value, in future ease of development, to cost, in a reviewer's effort as well as mine.

I'll send PRs of my changes shortly.

gnprice · 2019-08-05T04:25:24Z

Just posted three PRs:

inspect raises exception when frames are misleading about source line numbers #59333 and file.readline() cannot read weird ascii character in file #59334 are both quite small
remove redundant paragraph in socket howto #59335 is the first of two patches factoring out common parsing logic.

Two remaining patches go on top of #59335. Here are drafts, in case they're helpful for reference:

Patch 2/2 factoring out common parsing logic: gnprice@0a32a7111
Patch converting the big tuples to a dataclass: gnprice@6d8103bbc

I figure they may be easiest to review after the PR they depend on is merged, so my plan is to send PRs for them each in turn.

benjaminp · 2019-08-13T05:10:34Z

Thanks for working this. In your interested in doing some more hacking on Unicode data, there's bpo-32771.

benjaminp · 2019-08-13T05:21:07Z

New changeset ef2af1a by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37760: Factor out the basic UCD parsing logic of makeunicodedata. (GH-15130)
ef2af1a

miss-islington · 2019-08-13T05:23:45Z

New changeset 4e3dfcc by Miss Islington (bot) (Greg Price) in branch 'master':
bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171)
4e3dfcc

benjaminp · 2019-08-13T05:59:34Z

New changeset 99d208e by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37760: Constant-fold some old options in makeunicodedata. (GH-15129)
99d208e

miss-islington · 2019-08-13T06:04:24Z

New changeset c2b9d9f by Miss Islington (bot) in branch '3.8':
bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171)
c2b9d9f

miss-islington · 2019-08-13T06:06:20Z

New changeset b02e148 by Miss Islington (bot) in branch '3.8':
bpo-37760: Mark all generated Unicode data headers as generated. (GH-15171)
b02e148

vstinner · 2019-08-13T12:33:28Z

$ find -name "*db.h"
./Modules/unicodedata_db.h
./Modules/unicodename_db.h
./Objects/unicodetype_db.h

For .gitattributes, I would prefer to use unicode*_db.h pattern rather than "a generic" *_db.h, to avoid masking other potential file called "something_db.h" in the future in GitHub. But it's not really an issue right now ;-)

vstinner · 2019-08-13T12:35:15Z

"from typing import *"

I like to run pyflakes time to time on the Python code base. Please avoid "import *" since it prevents pyflakes (and other code analyzers) to find bugs. Example:

$ pyflakes Tools/unicode/makeunicodedata.py
Tools/unicode/makeunicodedata.py:35: 'from typing import *' used; unable to detect undefined names
Tools/unicode/makeunicodedata.py:921: 'Iterator' may be undefined, or defined from star imports: typing
Tools/unicode/makeunicodedata.py:921: 'List' may be undefined, or defined from star imports: typing
Tools/unicode/makeunicodedata.py:921: 'Iterator' may be undefined, or defined from star imports: typing
...

malemburg · 2019-08-13T12:49:27Z

BTW: Since when do we use type annotations in Python's stdlib ?

gnprice · 2019-08-13T17:33:32Z

I like to run pyflakes time to time on the Python code base. Please avoid "import *" since it prevents pyflakes (and other code analyzers) to find bugs.

Ah fair enough, thanks!

Pushed that change to the next/current PR, #59453.

gnprice · 2019-08-13T17:46:47Z

BTW: Since when do we use type annotations in Python's stdlib ?

Hmm, interesting question!

At a quick grep, it's in a handful of places in the stdlib: asyncio, functools, importlib. The earliest it appeared was in 3.7.0a4.

It's in more places in the test suite, which I think is a closer parallel to this maintainer script in Tools/.

The typing module itself is in the stdlib, so I don't see any obstacle to using it more widely. I imagine the main reason it doesn't appear more widely already is simply that it's new, and most of the stdlib is quite stable.

serhiy-storchaka · 2019-08-13T18:22:50Z

What is the minimal Python version for developing CPython? The system Python 3 on current Ubuntu LTS (18.04) is 3.6, so I think it should not be larger.

gnprice · 2019-08-13T18:48:18Z

What is the minimal Python version for developing CPython? The system Python 3 on current Ubuntu LTS (18.04) is 3.6, so I think it should not be larger.

Ah, I think my previous message had an ambiguous parse: the earliest that *uses* of the typing module appeared in the stdlib was 3.7. The typing module has been around longer than that.

I just checked and python3.6 [Tools/unicode/makeunicodedata.py](https://github.com/python/cpython/blob/main/Tools/unicode/makeunicodedata.py) works fine, both at master and with #59453.

I think it would be OK for doing development on CPython to require the latest minor version (i.e. 3.7) -- after all, if you're doing development, you're already building it, so you can always get a newer version than your system provides if needed. But happily the question is moot here, so I guess the place to discuss that further would be a new thread.

serhiy-storchaka · 2019-08-13T18:56:03Z

I just checked and python3.6 [Tools/unicode/makeunicodedata.py](https://github.com/python/cpython/blob/main/Tools/unicode/makeunicodedata.py) works fine, both at master and with #59453.

This is good. But the title mentioned dataclasses, and they are 3.7+.

gnprice · 2019-08-13T20:02:05Z

This is good. But the title mentioned dataclasses, and they are 3.7+.

Ahh, sorry, I think now I understand you. :-)

Indeed, when I switch to the branch with that change (gnprice@2b4aec4dd -- it comes after the patch that's #59453, so I haven't yet sent it as a PR), then python3.6 [Tools/unicode/makeunicodedata.py](https://github.com/python/cpython/blob/main/Tools/unicode/makeunicodedata.py) no longer works.

I think this is fine. Most of all that's because this always works:

./python [Tools/unicode/makeunicodedata.py](https://github.com/python/cpython/blob/main/Tools/unicode/makeunicodedata.py)

Anyone who's going to be running that script will want to build a ./python right afterward, in order to at least run the tests. So it doesn't seem like much trouble to do the build first and then run the script (and then a quick rebuild for the handful of changed files), if indeed the person doesn't already have a ./python lying around.

In fact ./python is exactly what I used most of the time to run this script when I was developing these changes, simply because it seemed like the natural thing to do.

benjaminp · 2019-08-14T02:27:15Z

From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.

benjaminp · 2019-08-14T02:28:45Z

New changeset c03e698 by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37760: Factor out standard range-expanding logic in makeunicodedata. (GH-15248)
c03e698

gnprice · 2019-08-14T04:55:24Z

From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.

Yeah, fair concern. In fact I think I'm on video (from PyCon 2018) warning everyone not to do that in their codebases, because what you really don't want is a bunch of annotations that have gradually accumulated falsehoods as the code has changed around them.

Still, I think from "some annotations + no checking" the good equilibrium to land in "some annotations + checking", not "no annotations + no checking". (I do mean "some" -- I don't predict we'll ever go sweep all over adding them.) And I think the highest-probability way to get there is to let them continue to accumulate where people occasionally add them in new/revised code... because that holds a door open for someone to step up to start checking them, and then to do the work to make that part of CI. (That someone might even be me! But I can think of plenty of other likely folks to do it.)

If we accumulated quite a lot of them and nobody had yet stepped up to make checking happen, I'd worry. But with the smattering we currently have, I think that point is far off.

vstinner · 2019-08-14T10:25:37Z

From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.

Even if unchecked, type annotations can serve as builtin documentation, as docstrings (even when docstrings are not checked ;-)).

Well, I don't have a strong opinion on type annotations in the Python stdlib. But here we are talking about a tool which is not installed, so I don't think that it matters much. It's not really part of the "stdlib".

benjaminp · 2019-08-15T01:19:00Z

New changeset 3e4498d by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37760: Avoid cluttering work tree with downloaded Unicode files. (GH-15128)
3e4498d

benjaminp · 2019-08-15T01:24:14Z

On Wed, Aug 14, 2019, at 03:25, STINNER Victor wrote:

STINNER Victor <vstinner@redhat.com> added the comment:

> From my perspective, the main problem with using type annotations is that there's nothing checking them in CI.

Even if unchecked, type annotations can serve as builtin documentation,
as docstrings (even when docstrings are not checked ;-)).

Yes, but no one has the expectation that docstrings are automatically verified in any way. :)

gnprice · 2019-08-21T05:48:50Z

(A bit easy to miss in the way this thread gets displayed, so to highlight in a comment: #59470 is up, following the 5 other patches which have now all been merged. That's the one that replaces the length-18 tuples with a dataclass.)

benjaminp · 2019-09-12T09:23:46Z

New changeset a65678c by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. (GH-15265)
a65678c

gnprice · 2019-09-18T06:11:56Z

Thanks Benjamin for reviewing and merging this series!

gnprice added 3.9 only security fixes topic-unicode type-feature A feature request or enhancement labels Aug 5, 2019

benjaminp closed this as completed Sep 12, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor makeunicodedata.py: dedupe parsing, use dataclass #81941

Refactor makeunicodedata.py: dedupe parsing, use dataclass #81941

gnprice commented Aug 5, 2019

gnprice commented Aug 5, 2019

gnprice commented Aug 5, 2019

benjaminp commented Aug 13, 2019

benjaminp commented Aug 13, 2019

miss-islington commented Aug 13, 2019

benjaminp commented Aug 13, 2019

miss-islington commented Aug 13, 2019

miss-islington commented Aug 13, 2019

vstinner commented Aug 13, 2019

vstinner commented Aug 13, 2019

malemburg commented Aug 13, 2019

gnprice commented Aug 13, 2019

gnprice commented Aug 13, 2019

serhiy-storchaka commented Aug 13, 2019

gnprice commented Aug 13, 2019

serhiy-storchaka commented Aug 13, 2019

gnprice commented Aug 13, 2019

benjaminp commented Aug 14, 2019

benjaminp commented Aug 14, 2019

gnprice commented Aug 14, 2019

vstinner commented Aug 14, 2019

benjaminp commented Aug 15, 2019

benjaminp commented Aug 15, 2019

gnprice commented Aug 21, 2019

benjaminp commented Sep 12, 2019

gnprice commented Sep 18, 2019

Refactor makeunicodedata.py: dedupe parsing, use dataclass #81941

Refactor makeunicodedata.py: dedupe parsing, use dataclass #81941

Comments

gnprice commented Aug 5, 2019

gnprice commented Aug 5, 2019

gnprice commented Aug 5, 2019

benjaminp commented Aug 13, 2019

benjaminp commented Aug 13, 2019

miss-islington commented Aug 13, 2019

benjaminp commented Aug 13, 2019

miss-islington commented Aug 13, 2019

miss-islington commented Aug 13, 2019

vstinner commented Aug 13, 2019

vstinner commented Aug 13, 2019

malemburg commented Aug 13, 2019

gnprice commented Aug 13, 2019

gnprice commented Aug 13, 2019

serhiy-storchaka commented Aug 13, 2019

gnprice commented Aug 13, 2019

serhiy-storchaka commented Aug 13, 2019

gnprice commented Aug 13, 2019

benjaminp commented Aug 14, 2019

benjaminp commented Aug 14, 2019

gnprice commented Aug 14, 2019

vstinner commented Aug 14, 2019

benjaminp commented Aug 15, 2019

benjaminp commented Aug 15, 2019

gnprice commented Aug 21, 2019

benjaminp commented Sep 12, 2019

gnprice commented Sep 18, 2019