classification
Title: `repr()` of string in NFC and NFD forms does not differ
Type: enhancement Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.8
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, ezio.melotti, lemburg, pekka.klarck, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2018-04-20 07:24 by pekka.klarck, last changed 2018-04-25 04:32 by benjamin.peterson. This issue is now closed.

Messages (7)
msg315504 - (view) Author: Pekka Klärck (pekka.klarck) Date: 2018-04-20 07:24
If I have two strings that look the same but have different Unicode form, it's very hard to see where the problem actually is:

>>> a = 'hyv\xe4'
>>> b = 'hyva\u0308'
>>> print(a)
hyvä
>>> print(b)
hyvä
>>> a == b
False
>>> print(repr(a))
'hyvä'
>>> print(repr(b))
'hyvä'

This affects, for example, test automation frameworks using `repr()` in error reporting. For example, both unittest and pytest report `self.assertEqual('hyv\xe4', 'hyva\u0308')` like this:

AssertionError: 'hyvä' != 'hyvä'
- hyvä
+ hyvä

Because the NFC form is used by strings by default, I would propose that `repr()` would show the decomposed form if the string is in NFD. In practice I'd like `repr('hyva\0308')` to yield `'hyva\0308'`.
msg315505 - (view) Author: Pekka Klärck (pekka.klarck) Date: 2018-04-20 07:32
Forgot to mention that this doesn't affect Python 2:

>>> a = u'hyv\xe4'
>>> b = u'hyva\u0308'
>>> print(repr(a))
u'hyv\xe4'
>>> print(repr(b))
u'hyva\u0308'


In addition to hoping `repr()` would be enhanced in future Python 3 versions, I'm also looking for a way how to show differences between strings that look the same but are different. Currently the best I've found is this:

>>> print('hyva\u0308'.encode('unicode_escape').decode('ASCII'))
hyva\u0308
msg315506 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-04-20 07:54
Use ascii() in Python 3 if you want the behavior of repr() in Python 2. It escapes all non-ascii characters.

But escaping only combining characters in addition to non-printable characters in repr() looks an interesting idea.
msg315507 - (view) Author: Pekka Klärck (pekka.klarck) Date: 2018-04-20 08:53
Thanks for pointing out `ascii()`. Seems to do exactly what I want.

`repr()` showing combining characters would, in my opinion, still be useful to avoid problems like I demonstrated with unittest and pytest. I doubt it's a good idea with them to use `ascii()` instead of `repr()` by default because on Python 3 the latter generally works much better with non-ASCII text.
msg315685 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-04-24 03:49
As stated, the bug report is invalid: the repr _does_ differ, it's just not presented that way by however you're viewing the two reprs. Distinct codepoint sequences that look identical under certain circumstances can happen many different ways with Unicode. repr's humble mission is to produce a Python literal equivalent to its argument not to produce unambiguous representations of codepoint sequences after font rendering.

Possibly, this could be converted to a unittest RFE, but I'm not sure if there's a good way to detect whether two unicode strings are going to display confusingly similarly.
msg315695 - (view) Author: Pekka Klärck (pekka.klarck) Date: 2018-04-24 11:33
I didn't submit this as a bug report but as an enhancement request. From usability point of view, saying that results differ but you just cannot see the difference is not very helpful.

The exact reason I didn't submit this as an enhancement request for unittest, pytest, and all other modules/tools being affected is that "I'm not sure if there's a good way to detect whether two unicode strings are going to display confusingly similarly". Enhancing `repr()` would be a logical solution to this problem.

Finally, would any harm be done if `repr('hyva\u0308')` would be changed to `'hyva\\u0308'`? I don't see it being any different than `repr('foo\x00')` being `'foo\\x00'`; in both cases you can `eval()` the result to get the original value back like `repr()` is supposed to do when possible. Most importantly, the result would show that the value actually contains like you generally expect `repr()` to do.
msg315720 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-04-25 04:32
On Tue, Apr 24, 2018, at 04:33, Pekka Klärck wrote:
> 
> Pekka Klärck <pekka.klarck@gmail.com> added the comment:
> 
> I didn't submit this as a bug report but as an enhancement request. From 
> usability point of view, saying that results differ but you just cannot 
> see the difference is not very helpful.
> 
> The exact reason I didn't submit this as an enhancement request for 
> unittest, pytest, and all other modules/tools being affected is that 
> "I'm not sure if there's a good way to detect whether two unicode 
> strings are going to display confusingly similarly". Enhancing `repr()` 
> would be a logical solution to this problem.

I should have said "there's no way to unambiguously represent a particular  unicode string except as a sequence of integers, which isn't normally want anyone wants to see". This decomposition problem is only one of many. Even in ASCII land, fonts often have very similar glyphs for "l", "I", and "1".
History
Date User Action Args
2018-04-25 04:32:03benjamin.petersonsetmessages: + msg315720
2018-04-24 11:33:16pekka.klarcksetmessages: + msg315695
2018-04-24 03:49:42benjamin.petersonsetstatus: open -> closed
resolution: not a bug
messages: + msg315685

stage: resolved
2018-04-20 08:53:49pekka.klarcksetmessages: + msg315507
2018-04-20 07:54:47serhiy.storchakasetversions: + Python 3.8, - Python 3.4, Python 3.5, Python 3.6
nosy: + serhiy.storchaka, ezio.melotti, lemburg, benjamin.peterson, vstinner

messages: + msg315506

components: + Interpreter Core, Unicode
type: enhancement
2018-04-20 07:32:27pekka.klarcksetmessages: + msg315505
versions: + Python 3.4, Python 3.5, Python 3.6
2018-04-20 07:24:42pekka.klarckcreate