classification
Title: unicodedata checksum-tests only test 1/17th of Unicode's codepoints
Type: enhancement Stage: patch review
Components: Tests Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Greg Price, benjamin.peterson, vstinner
Priority: normal Keywords: patch

Created on 2019-08-05 01:06 by Greg Price, last changed 2019-08-15 04:52 by Greg Price.

Pull Requests
URL Status Linked Edit
PR 15125 open Greg Price, 2019-08-05 01:09
PR 15126 merged Greg Price, 2019-08-05 01:15
PR 15302 open Greg Price, 2019-08-15 04:52
Messages (3)
msg349014 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-05 01:06
The unicodedata module has two test cases which run through the database and make a hash of its visible outputs for all codepoints, comparing the hash against a checksum.  These are helpful regression tests for making sure the behavior isn't changed by patches that didn't intend to change it.

But Unicode has grown since Python first gained support for it, when Unicode itself was still rather new.  These test cases were added in commit 6a20ee7de back in 2000, and they haven't needed to change much since then... but they should be changed to look beyond the Basic Multilingual Plane (`range(0x10000)`) and cover all 17 planes of Unicode's final form.

Spotted in discussion on GH-15019 (https://github.com/python/cpython/pull/15019#discussion_r308947884 ).  I have a patch for this which I'll send shortly.
msg349016 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-05 01:22
Sent two small PRs!

The first one, GH-15125, makes the substantive test change I described above.

The second one, GH-15126, is a small pure refactor to that test file, just cleaning out some bits that made sense when it was first written (as a script) but are confusing now that it's a `unittest` test module.  Took me a couple of minutes to sort those out when I first dug into this file, and I figure it'd be kind to the next person to save them the same effort.
msg349523 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-13 05:58
New changeset def97c988be8340f33869b57942a30d10fc3a1f9 by Benjamin Peterson (Greg Price) in branch 'master':
bpo-37758: Clean out vestigial script-bits from test_unicodedata. (GH-15126)
https://github.com/python/cpython/commit/def97c988be8340f33869b57942a30d10fc3a1f9
History
Date User Action Args
2019-08-15 04:52:54Greg Pricesetpull_requests: + pull_request15027
2019-08-15 04:12:13Greg Pricesetnosy: + vstinner
2019-08-13 05:58:04benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg349523
2019-08-05 01:22:25Greg Pricesetmessages: + msg349016
2019-08-05 01:15:10Greg Pricesetpull_requests: + pull_request14866
2019-08-05 01:09:02Greg Pricesetkeywords: + patch
stage: patch review
pull_requests: + pull_request14865
2019-08-05 01:06:02Greg Pricecreate