Issue 47243: Duplicate entry in 'Objects/unicodetype_db.h'

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/91399

classification

Title:	Duplicate entry in 'Objects/unicodetype_db.h'
Type:	enhancement	Stage:	patch review
Components:	Unicode	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	LiarPrincess, ezio.melotti, vstinner
Priority:	normal	Keywords:	patch

Created on 2022-04-06 17:23 by LiarPrincess, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 32376	open	LiarPrincess, 2022-04-06 17:28

Messages (2)
msg416889 - (view)	Author: LiarPrincess (LiarPrincess) *	Date: 2022-04-06 17:23
This one is so tiny that I'm not really sure we want to merge it… === Problem === `Objects/unicodetype_db.h` starts in a following way: ```c /* a list of unique character type descriptors */ const _PyUnicode_TypeRecord _PyUnicode_TypeRecords[] = { {0, 0, 0, 0, 0, 0}, {0, 0, 0, 0, 0, 0}, {0, 0, 0, 0, 0, 32}, {0, 0, 0, 0, 0, 48}, … ``` The 1st record (`{0, 0, 0, 0, 0, 0}`) is duplicated. This is not a problem, since the 1st occurrence is never used, but if we wanted to remove it then this is the ticket about it. === Detailed description === `Objects/unicodetype_db.h` is generated by `Tools/unicode/makeunicodedata.py` (I removed irrelevant lines): ```py def makeunicodetype(unicode, trace): dummy = (0, 0, 0, 0, 0, 0) table = [dummy] # (1) cache = {0: dummy} # (2) for char in unicode.chars: # Things… item = (upper, lower, title, decimal, digit, flags) i = cache.get(item) # (3) if i is None: cache[item] = i = len(table) table.append(item) index[char] = i ``` - (1) - list which contains unique character properties (as `(upper, lower, title, decimal, digit, flags)` tuples) - (2) - mapping from character properties to index in `table` - improperly initialized as a mapping from index to character properties - (3) - we check if the current tuple is in `cache` === Result === The first time we get to a character that has `(0, 0, 0, 0, 0, 0)` properties (which is code point 0 - `NULL`) we check if it is in cache. It it not (there is an entry that goes from index `0` to `(0, 0, 0, 0, 0, 0)` - the other way around), so we add this entry to `table` and `cache`. === Fix === In the line `(2)` we should have: `cache = {dummy: 0}`. Obviously after doing so we have to run `makeunicodedata.py` - this is why this simple change modifies a lot of lines. I will submit PR on github in just a sec…
msg416892 - (view)	Author: LiarPrincess (LiarPrincess) *	Date: 2022-04-06 17:48
CLA is signed, but there is this 'it might take a few days before your tracker profile is updated'. Added version 3.11 (present also in previous versions, bot no point in back-porting it). Github: https://github.com/python/cpython/pull/32376

History
Date	User	Action	Args
2022-04-11 14:59:58	admin	set	github: 91399
2022-04-06 17:48:43	LiarPrincess	set	messages: + msg416892 versions: + Python 3.11
2022-04-06 17:28:51	LiarPrincess	set	keywords: + patch stage: patch review pull_requests: + pull_request30419
2022-04-06 17:23:31	LiarPrincess	create