This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: dataclasses should use NFKC to find duplicate members
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: eric.smith Nosy List: cheryl.sabella, eric.smith, rhettinger, steve.dower, thatiparthy, valer
Priority: low Keywords: patch

Created on 2018-06-16 18:02 by eric.smith, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 7916 closed valer, 2018-06-25 21:58
PR 20837 closed thatiparthy, 2020-06-12 18:49
Messages (8)
msg319766 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2018-06-16 18:02
See issue 33880 for the same issue with namedtuple.

This shows up on dataclasses only through make_dataclass. This is an expected ValueError:

>>> make_dataclass('a', ['a', 'b', 'c', 'a'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 1123, in make_dataclass
    raise TypeError(f'Field name duplicated: {name!r}')
TypeError: Field name duplicated: 'a'

But this is a SyntaxError:

>>> make_dataclass('a', ['\u00b5', '\u03bc'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 1133, in make_dataclass
    unsafe_hash=unsafe_hash, frozen=frozen)
  File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 958, in dataclass
    return wrap(_cls)
  File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 950, in wrap
    return _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
  File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 871, in _process_class
    else 'self',
  File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 490, in _init_fn
    return_type=None)
  File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 356, in _create_fn
    exec(txt, globals, locals)
  File "<string>", line 1
SyntaxError: duplicate argument 'μ' in function definition
msg320446 - (view) Author: Valeriya Sinevich (valer) * Date: 2018-06-25 22:01
I sent a PR and measured how it affected the performance by creating a dataclass with 10000 members.
python.bat -m timeit -s "from dataclasses import make_dataclass" -s "arg_list = [chr(k) * i for k in range(97, 123) for i in range(1, 500)]" "make_dataclass('a', arg_list)" 
The performance didn't change, both before and after the changes it takes around 14.5s.
msg320454 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-06-25 23:21
The benchmark may not be triggering that much work. NFKC normalization only applies for characters outside of the basic Latin characters (0-255).

I ran the below benchmarks and saw a huge difference. Granted, it's a very degenerate case with collections this big, but it appears to be linear with len(NAMES), suggesting that the normalization is the expensive part.

>>> CHRS=[c for c in (chr(i) for i in range(65535)) if c.isidentifier()]
>>> def makename():
...  return ''.join(random.choice(CHRS) for _ in range(10))
...
>>> NAMES = [makename() for _ in range(10000)]
>>> timeit.timeit('len(set(NAMES))', globals=globals(), number=100000)
38.04007526000004
>>> timeit.timeit('len(set(unicodedata.normalize("NFKC", n) for n in NAMES))', globals=globals(), number=100000)
820.2586788580002

I wonder if it's better to catch the SyntaxError and do the check there? That way we don't really have a performance impact, since it's only going to show up in exceptional cases anyway.
msg320455 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-06-25 23:23
> outside of the basic Latin characters (0-255)

This should be (0-127). The Latin characters in 128-255 are considered extended ones, and may be decomposed by normalization.
msg371356 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2020-06-12 12:00
The pull request that was opened for this has been closed due to inactivity, so this issue is available for anyone to work on.
msg371424 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-06-12 22:19
I recommend leaving the code as-is.  AFAICT, the situation almost never arises in practice, and if it did, the existing SyntaxError is clear.

Given that the rest of that language uses a SyntaxError, there isn't a net benefit for switching to TypeError:

>>> def f(µ=1, μ=2):
    	    pass
    	
SyntaxError: duplicate argument 'μ' in function definition
>>> def f(**kwds):
	return kwds

>>> f(µ=1, μ=2)
SyntaxError: keyword argument repeated
msg371435 - (view) Author: Srinivas Reddy Thatiparthy(శ్రీనివాస్ రెడ్డి తాటిపర్తి) (thatiparthy) * Date: 2020-06-13 04:28
I am convinced by Raymond’s argument. Hence closing the PR.
msg371463 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-06-13 16:42
I agree with Raymond. There's no real harm being caused here.
History
Date User Action Args
2022-04-11 14:59:01adminsetgithub: 78062
2020-06-13 16:42:00eric.smithsetstatus: open -> closed
resolution: wont fix
messages: + msg371463

stage: patch review -> resolved
2020-06-13 04:28:47thatiparthysetmessages: + msg371435
2020-06-12 22:19:45rhettingersetnosy: + rhettinger
messages: + msg371424
2020-06-12 18:49:29thatiparthysetnosy: + thatiparthy
pull_requests: + pull_request20031
2020-06-12 12:00:51cheryl.sabellasetnosy: + cheryl.sabella

messages: + msg371356
versions: + Python 3.8, Python 3.9, Python 3.10, - Python 3.7
2018-06-25 23:23:09steve.dowersetmessages: + msg320455
2018-06-25 23:21:33steve.dowersetmessages: + msg320454
2018-06-25 22:01:48valersetnosy: + valer, steve.dower
messages: + msg320446
2018-06-25 21:58:52valersetkeywords: + patch
stage: patch review
pull_requests: + pull_request7519
2018-06-16 19:08:50eric.smithsettitle: dataclasses should use NFKD to find duplicate members -> dataclasses should use NFKC to find duplicate members
2018-06-16 18:02:08eric.smithcreate