Message 401078 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	maxbachmann
Recipients	ezio.melotti, maxbachmann, vstinner
Date	2021-09-05.11:12:09
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1630840329.53.0.683865786934.issue45105@roundup.psfhosted.org>
In-reply-to

Content
I noticed that when using the Unicode character \U00010900 when inserting the character as character: Here is the result on the Python console both for 3.6 and 3.9: ``` >>> s = '0𐤀00' >>> s '0𐤀00' >>> ls = list(s) >>> ls ['0', '𐤀', '0', '0'] >>> s[0] '0' >>> s[1] '𐤀' >>> s[2] '0' >>> s[3] '0' >>> ls[0] '0' >>> ls[1] '𐤀' >>> ls[2] '0' >>> ls[3] '0' ``` It appears that for some reason in this specific case the character is actually stored in a different position that shown when printing the complete string. Note that the string is already behaving strange when marking it in the console. When marking the special character it directly highlights the last 3 characters (probably because it already thinks this character is in the second position). The same behavior does not occur when directly using the unicode point ``` >>> s='000\U00010900' >>> s '000𐤀' >>> s[0] '0' >>> s[1] '0' >>> s[2] '0' >>> s[3] '𐤀' ``` This was tested using the following Python versions: ``` Python 3.6.0 (default, Dec 29 2020, 02:18:14) [GCC 10.2.1 20201125 (Red Hat 10.2.1-9)] on linux Python 3.9.6 (default, Jul 16 2021, 00:00:00) [GCC 11.1.1 20210531 (Red Hat 11.1.1-3)] on linux ``` on Fedora 34

I noticed that when using the Unicode character \U00010900 when inserting the character as character:
Here is the result on the Python console both for 3.6 and 3.9:
```
>>> s = '0𐤀00'
>>> s
'0𐤀00'
>>> ls = list(s)
>>> ls
['0', '𐤀', '0', '0']
>>> s[0]
'0'
>>> s[1]
'𐤀'
>>> s[2]
'0'
>>> s[3]
'0'
>>> ls[0]
'0'
>>> ls[1]
'𐤀'
>>> ls[2]
'0'
>>> ls[3]
'0'
```

It appears that for some reason in this specific case the character is actually stored in a different position that shown when printing the complete string. Note that the string is already behaving strange when marking it in the console. When marking the special character it directly highlights the last 3 characters (probably because it already thinks this character is in the second position).

The same behavior does not occur when directly using the unicode point
```
>>> s='000\U00010900'
>>> s
'000𐤀'
>>> s[0]
'0'
>>> s[1]
'0'
>>> s[2]
'0'
>>> s[3]
'𐤀'
```

This was tested using the following Python versions:
```
Python 3.6.0 (default, Dec 29 2020, 02:18:14) 
[GCC 10.2.1 20201125 (Red Hat 10.2.1-9)] on linux

Python 3.9.6 (default, Jul 16 2021, 00:00:00) 
[GCC 11.1.1 20210531 (Red Hat 11.1.1-3)] on linux
```
on Fedora 34

History
Date	User	Action	Args
2021-09-05 11:12:09	maxbachmann	set	recipients: + maxbachmann, vstinner, ezio.melotti
2021-09-05 11:12:09	maxbachmann	set	messageid: <1630840329.53.0.683865786934.issue45105@roundup.psfhosted.org>
2021-09-05 11:12:09	maxbachmann	link	issue45105 messages
2021-09-05 11:12:09	maxbachmann	create