Title: unicodedata checksum-tests only test 1/17th of Unicode's codepoints
Type: enhancement Stage: resolved
Components: Tests Versions: Python 3.9
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Greg Price, benjamin.peterson, twouters, vstinner
Priority: normal Keywords: patch

Created on 2019-08-05 01:06 by Greg Price, last changed 2019-09-12 09:25 by benjamin.peterson. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15125 merged Greg Price, 2019-08-05 01:09
PR 15126 merged Greg Price, 2019-08-05 01:15
PR 15302 merged Greg Price, 2019-08-15 04:52
Messages (5)
msg349014 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-05 01:06
The unicodedata module has two test cases which run through the database and make a hash of its visible outputs for all codepoints, comparing the hash against a checksum.  These are helpful regression tests for making sure the behavior isn't changed by patches that didn't intend to change it.

But Unicode has grown since Python first gained support for it, when Unicode itself was still rather new.  These test cases were added in commit 6a20ee7de back in 2000, and they haven't needed to change much since then... but they should be changed to look beyond the Basic Multilingual Plane (`range(0x10000)`) and cover all 17 planes of Unicode's final form.

Spotted in discussion on GH-15019 ( ).  I have a patch for this which I'll send shortly.
msg349016 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-05 01:22
Sent two small PRs!

The first one, GH-15125, makes the substantive test change I described above.

The second one, GH-15126, is a small pure refactor to that test file, just cleaning out some bits that made sense when it was first written (as a script) but are confusing now that it's a `unittest` test module.  Took me a couple of minutes to sort those out when I first dug into this file, and I figure it'd be kind to the next person to save them the same effort.
msg349523 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-13 05:58
New changeset def97c988be8340f33869b57942a30d10fc3a1f9 by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37758: Clean out vestigial script-bits from test_unicodedata. (GH-15126)
msg351499 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2019-09-09 15:20
New changeset 3cbc23aa229bc5ec04845053df78eae5f54e0497 by T. Wouters (Greg Price) in branch 'master':
bpo-37758: Cut always-constant conditionals on sys.maxunicode. (GH-15302)
msg352069 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-09-12 09:25
New changeset 6954be815a16fad11d1d66be576865bbbeb2b97d by Benjamin Peterson (Greg Price) in branch 'master':
closes bpo-37758: Extend unicodedata checksum tests to cover all of Unicode. (GH-15125)
Date User Action Args
2019-09-12 09:25:28benjamin.petersonsetstatus: open -> closed
resolution: fixed
messages: + msg352069

stage: patch review -> resolved
2019-09-09 15:20:43twouterssetnosy: + twouters
messages: + msg351499
2019-08-15 04:52:54Greg Pricesetpull_requests: + pull_request15027
2019-08-15 04:12:13Greg Pricesetnosy: + vstinner
2019-08-13 05:58:04benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg349523
2019-08-05 01:22:25Greg Pricesetmessages: + msg349016
2019-08-05 01:15:10Greg Pricesetpull_requests: + pull_request14866
2019-08-05 01:09:02Greg Pricesetkeywords: + patch
stage: patch review
pull_requests: + pull_request14865
2019-08-05 01:06:02Greg Pricecreate