Issue 33317: `repr()` of string in NFC and NFD forms does not differ

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77498

classification

Title:	`repr()` of string in NFC and NFD forms does not differ
Type:	enhancement	Stage:	resolved
Components:	Interpreter Core, Unicode	Versions:	Python 3.8

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	benjamin.peterson, ezio.melotti, lemburg, pekka.klarck, serhiy.storchaka, vstinner
Priority:	normal	Keywords:

Created on 2018-04-20 07:24 by pekka.klarck, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (7)
msg315504 - (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2018-04-20 07:24
If I have two strings that look the same but have different Unicode form, it's very hard to see where the problem actually is: >>> a = 'hyv\xe4' >>> b = 'hyva\u0308' >>> print(a) hyvä >>> print(b) hyvä >>> a == b False >>> print(repr(a)) 'hyvä' >>> print(repr(b)) 'hyvä' This affects, for example, test automation frameworks using `repr()` in error reporting. For example, both unittest and pytest report `self.assertEqual('hyv\xe4', 'hyva\u0308')` like this: AssertionError: 'hyvä' != 'hyvä' - hyvä + hyvä Because the NFC form is used by strings by default, I would propose that `repr()` would show the decomposed form if the string is in NFD. In practice I'd like `repr('hyva\0308')` to yield `'hyva\0308'`.
msg315505 - (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2018-04-20 07:32
Forgot to mention that this doesn't affect Python 2: >>> a = u'hyv\xe4' >>> b = u'hyva\u0308' >>> print(repr(a)) u'hyv\xe4' >>> print(repr(b)) u'hyva\u0308' In addition to hoping `repr()` would be enhanced in future Python 3 versions, I'm also looking for a way how to show differences between strings that look the same but are different. Currently the best I've found is this: >>> print('hyva\u0308'.encode('unicode_escape').decode('ASCII')) hyva\u0308
msg315506 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-04-20 07:54
Use ascii() in Python 3 if you want the behavior of repr() in Python 2. It escapes all non-ascii characters. But escaping only combining characters in addition to non-printable characters in repr() looks an interesting idea.
msg315507 - (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2018-04-20 08:53
Thanks for pointing out `ascii()`. Seems to do exactly what I want. `repr()` showing combining characters would, in my opinion, still be useful to avoid problems like I demonstrated with unittest and pytest. I doubt it's a good idea with them to use `ascii()` instead of `repr()` by default because on Python 3 the latter generally works much better with non-ASCII text.
msg315685 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2018-04-24 03:49
As stated, the bug report is invalid: the repr _does_ differ, it's just not presented that way by however you're viewing the two reprs. Distinct codepoint sequences that look identical under certain circumstances can happen many different ways with Unicode. repr's humble mission is to produce a Python literal equivalent to its argument not to produce unambiguous representations of codepoint sequences after font rendering. Possibly, this could be converted to a unittest RFE, but I'm not sure if there's a good way to detect whether two unicode strings are going to display confusingly similarly.
msg315695 - (view)	Author: Pekka Klärck (pekka.klarck)	Date: 2018-04-24 11:33
I didn't submit this as a bug report but as an enhancement request. From usability point of view, saying that results differ but you just cannot see the difference is not very helpful. The exact reason I didn't submit this as an enhancement request for unittest, pytest, and all other modules/tools being affected is that "I'm not sure if there's a good way to detect whether two unicode strings are going to display confusingly similarly". Enhancing `repr()` would be a logical solution to this problem. Finally, would any harm be done if `repr('hyva\u0308')` would be changed to `'hyva\\u0308'`? I don't see it being any different than `repr('foo\x00')` being `'foo\\x00'`; in both cases you can `eval()` the result to get the original value back like `repr()` is supposed to do when possible. Most importantly, the result would show that the value actually contains like you generally expect `repr()` to do.
msg315720 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2018-04-25 04:32
On Tue, Apr 24, 2018, at 04:33, Pekka Klärck wrote: > > Pekka Klärck <pekka.klarck@gmail.com> added the comment: > > I didn't submit this as a bug report but as an enhancement request. From > usability point of view, saying that results differ but you just cannot > see the difference is not very helpful. > > The exact reason I didn't submit this as an enhancement request for > unittest, pytest, and all other modules/tools being affected is that > "I'm not sure if there's a good way to detect whether two unicode > strings are going to display confusingly similarly". Enhancing `repr()` > would be a logical solution to this problem. I should have said "there's no way to unambiguously represent a particular unicode string except as a sequence of integers, which isn't normally want anyone wants to see". This decomposition problem is only one of many. Even in ASCII land, fonts often have very similar glyphs for "l", "I", and "1".

History
Date	User	Action	Args
2022-04-11 14:58:59	admin	set	github: 77498
2018-04-25 04:32:03	benjamin.peterson	set	messages: + msg315720
2018-04-24 11:33:16	pekka.klarck	set	messages: + msg315695
2018-04-24 03:49:42	benjamin.peterson	set	status: open -> closed resolution: not a bug messages: + msg315685 stage: resolved
2018-04-20 08:53:49	pekka.klarck	set	messages: + msg315507
2018-04-20 07:54:47	serhiy.storchaka	set	versions: + Python 3.8, - Python 3.4, Python 3.5, Python 3.6 nosy: + serhiy.storchaka, ezio.melotti, lemburg, benjamin.peterson, vstinner messages: + msg315506 components: + Interpreter Core, Unicode type: enhancement
2018-04-20 07:32:27	pekka.klarck	set	messages: + msg315505 versions: + Python 3.4, Python 3.5, Python 3.6
2018-04-20 07:24:42	pekka.klarck	create