This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Duplicate entry in 'Objects/unicodetype_db.h'
Type: enhancement Stage: patch review
Components: Unicode Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: LiarPrincess, ezio.melotti, vstinner
Priority: normal Keywords: patch

Created on 2022-04-06 17:23 by LiarPrincess, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 32376 open LiarPrincess, 2022-04-06 17:28
Messages (2)
msg416889 - (view) Author: LiarPrincess (LiarPrincess) * Date: 2022-04-06 17:23
This one is so tiny that I'm not really sure we want to merge it…

=== Problem ===

`Objects/unicodetype_db.h` starts in a following way:

```c
/* a list of unique character type descriptors */
const _PyUnicode_TypeRecord _PyUnicode_TypeRecords[] = {
    {0, 0, 0, 0, 0, 0},
    {0, 0, 0, 0, 0, 0},
    {0, 0, 0, 0, 0, 32},
    {0, 0, 0, 0, 0, 48},
    …
```

The 1st record (`{0, 0, 0, 0, 0, 0}`) is duplicated.
This is not a problem, since the 1st occurrence is never used, but if we wanted to remove it then this is the ticket about it.

=== Detailed description ===

`Objects/unicodetype_db.h` is generated by `Tools/unicode/makeunicodedata.py` (I removed irrelevant lines):

```py
def makeunicodetype(unicode, trace):
    dummy = (0, 0, 0, 0, 0, 0)
    table = [dummy] # (1)
    cache = {0: dummy} # (2)

    for char in unicode.chars:
        # Things…

        item = (upper, lower, title, decimal, digit, flags)

        i = cache.get(item) # (3)
        if i is None:
            cache[item] = i = len(table)
            table.append(item)

        index[char] = i
```

- (1) - list which contains unique character properties (as `(upper, lower, title, decimal, digit, flags)` tuples)
- (2) - mapping from character properties to index in `table` - improperly initialized as a mapping from index to character properties
- (3) - we check if the current tuple is in `cache`

=== Result ===

The first time we get to a character that has `(0, 0, 0, 0, 0, 0)` properties (which is code point 0 - `NULL`) we check if it is in cache. It it not (there is an entry that goes from index `0` to `(0, 0, 0, 0, 0, 0)` - the other way around), so we add this entry to `table` and `cache`.

=== Fix ===

In the line `(2)` we should have: `cache = {dummy: 0}`. Obviously after doing so we have to run `makeunicodedata.py` - this is why this simple change modifies a lot of lines.

I will submit PR on github in just a sec…
msg416892 - (view) Author: LiarPrincess (LiarPrincess) * Date: 2022-04-06 17:48
CLA is signed, but there is this 'it might take a few days before your tracker profile is updated'.

Added version 3.11 (present also in previous versions, bot no point in back-porting it).

Github: https://github.com/python/cpython/pull/32376
History
Date User Action Args
2022-04-11 14:59:58adminsetgithub: 91399
2022-04-06 17:48:43LiarPrincesssetmessages: + msg416892
versions: + Python 3.11
2022-04-06 17:28:51LiarPrincesssetkeywords: + patch
stage: patch review
pull_requests: + pull_request30419
2022-04-06 17:23:31LiarPrincesscreate