classification
Title: Incorrect handling of unicode character \U00010900
Type: behavior Stage:
Components: Unicode Versions: Python 3.9, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ezio.melotti, maxbachmann, ronaldoussoren, serhiy.storchaka, steven.daprano
Priority: normal Keywords:

Created on 2021-09-05 11:12 by maxbachmann, last changed 2021-09-12 08:57 by serhiy.storchaka.

Files
File name Uploaded Description Edit
selection.png steven.daprano, 2021-09-05 13:08
Messages (12)
msg401078 - (view) Author: Max Bachmann (maxbachmann) * Date: 2021-09-05 11:12
I noticed that when using the Unicode character \U00010900 when inserting the character as character:
Here is the result on the Python console both for 3.6 and 3.9:
```
>>> s = '0𐤀00'
>>> s
'0𐤀00'
>>> ls = list(s)
>>> ls
['0', '𐤀', '0', '0']
>>> s[0]
'0'
>>> s[1]
'𐤀'
>>> s[2]
'0'
>>> s[3]
'0'
>>> ls[0]
'0'
>>> ls[1]
'𐤀'
>>> ls[2]
'0'
>>> ls[3]
'0'
```

It appears that for some reason in this specific case the character is actually stored in a different position that shown when printing the complete string. Note that the string is already behaving strange when marking it in the console. When marking the special character it directly highlights the last 3 characters (probably because it already thinks this character is in the second position).

The same behavior does not occur when directly using the unicode point
```
>>> s='000\U00010900'
>>> s
'000𐤀'
>>> s[0]
'0'
>>> s[1]
'0'
>>> s[2]
'0'
>>> s[3]
'𐤀'
```

This was tested using the following Python versions:
```
Python 3.6.0 (default, Dec 29 2020, 02:18:14) 
[GCC 10.2.1 20201125 (Red Hat 10.2.1-9)] on linux

Python 3.9.6 (default, Jul 16 2021, 00:00:00) 
[GCC 11.1.1 20210531 (Red Hat 11.1.1-3)] on linux
```
on Fedora 34
msg401079 - (view) Author: Max Bachmann (maxbachmann) * Date: 2021-09-05 11:18
This is the result of copy pasting example posted above on windows using 
```
Python 3.7.8 (tags/v3.7.8:4b47a5b6ba, Jun 28 2020, 08:53:46) [MSC v.1916 64 bit (AMD64)] on win32
```
which appears to run into similar problems:
```
>>> s = '0��00'                                                                                                                                                                                                                                                                                                                                           >>> s                                                                                                                                                                                                                                                                                                                                                     '0𐤀00'                                                                                                                                                                                                                                                                                                                                                    >>> ls = list(s)                                                                                                                                                                                                                                                                                                                                          >>> ls                                                                                                                                                                                                                                                                                                                                                    ['0', '𐤀', '0', '0']                                                                                                                                                                                                                                                                                                                                      >>> s[0]                                                                                                                                                                                                                                                                                                                                                  '0'                                                                                                                                                                                                                                                                                                                                                       >>> s[1]                                                                                                                                                                                                                                                                                                                                                  '𐤀'
```
msg401082 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-09-05 12:53
AFAICT, there is no bug here. It's just confusing how Unicode right-to-left characters in the repr() can modify how it's displayed in the console/terminal. Use the ascii() representation to avoid the problem.

> The same behavior does not occur when directly using the unicode point
> ```
> >>> s='000\U00010900'

The original string has the Phoenician right-to-left character at index 1, not at index 3. The "0" number characters in the original have weak directionality when displayed. You can see the reversal with a numeric sequence that's separated by spaces. For example:

s = '123\U00010900456'
>>> print(*s, sep='\n')
1
2
3
𐤀
4
5
6
>>> print(*s)
1 2 3 𐤀 4 5 6

Latin letters have left-to-right directionality. For example:

>>> s = '123\U00010900abc'
>>> print(*s)
1 2 3 𐤀 a b c

You can check the bidirectional property [1] using the unicodedata module:

>>> import unicodedata as ud
>>> ud.bidirectional('\U00010900')
'R'
>>> ud.bidirectional('0')
'EN'
>>> ud.bidirectional('a')
'L'

---

[1] https://en.wikipedia.org/wiki/Unicode_character_property#Bidirectional_writing
msg401083 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-09-05 13:08
I'm afraid I cannot reproduce the problem.

>>> s = '000𐤀'  # \U00010900
>>> s
'000𐤀'
>>> s[0]
'0'
>>> s[1]
'0'
>>> s[2]
'0'
>>> s[3]
'𐤀'
>>> list(s)
['0', '0', '0', '𐤀']


That is using Python 3.9 in the xfce4-terminal. Which xterm are you using?

I am very confident that it is a bug in some external software, possibly the xterm, possibly the browser or other application where you copied the PHOENICIAN LETTER ALF character from in the first place. It looks like it is related to mishandling of the Right-To-Left character:

>>> unicodedata.bidirectional(s[3])
'R'


Using Firefox, when I attempt to select the text s = '000...' in Max's initial message with the mouse, the selection highlighting jumps around. See the screenshot attached. (selection.png) Depending on how I copy the text, sometimes I get '000 ALF' and sometimes '0 ALF 00' which hints that something is getting confused by the RTL character, possibly the browser, possible the copy/paste clipboard, possibly the terminal. But regardless, I cannot replicate the behaviour you show where list(s) is different from indexing the characters one by one.

It is very common for applications to mishandle mixed RTL and LTR characters, and that can have all sorts of odd display and copy/paste issues.
msg401084 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-09-05 13:15
Eryk Sun said:

> The original string has the Phoenician right-to-left character at index 1, not at index 3.


I think you may be mistaken. In Max's original post, he has

    s = '000X'

where the X is actually the Phoenician ALF character. At least that is how it is displayed in my browser.

(But note that in the Windows terminal, Max has '0X00' instead.)

Max's demonstration code shows a discrepancy between extracting the chars one by one using indexing, and with list. Simulating his error:

    s = '000X'  # X is actually ALF
    list(s)
    # --> returns [0 0 0 X]
    [s[i] for i in range(4)]  # indexing each char one at a time
    # --> returns [0 X 0 0]

I have not yet been able to replicate that reported behaviour.

I agree totally with Eryk Sun that this is probably not a Python bug. He thinks it is displaying the correct behaviour. I think it is probably a browser or xterm bug.

But unless someone can replicate the mismatch between list and indexing, I doubt it is something we can do anything about.
msg401086 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-09-05 13:56
> I think you may be mistaken. In Max's original post, he has
>   s = '000X'

It displays that way for me under Firefox in Linux, but what's really there when I copy it from Firefox is '0\U0001090000', which matches the result Max gets for individual index operations such as s[1]. 

The "0" characters following the R-T-L character have weak directionality. So the string displays the same as "000\U00010900". If you print with spaces and use a number sequence, the substring starting with the R-T-L character should display reversed, i.e. print(*'123\U00010900456') should display the same as print(*'123654\U00010900'). But "abc" in print(*'123\U00010900abc') should not display reversed since it has L-T-R directionality.
msg401088 - (view) Author: Max Bachmann (maxbachmann) * Date: 2021-09-05 14:51
> That is using Python 3.9 in the xfce4-terminal. Which xterm are you using?

This was in the default gnome terminal that is pre-installed on Fedora 34 and on windows I directly opened the Python Terminal. I just installed xfce4-terminal on my Fedora 34 machine which has exactly the same behavior for me that I had in the gnome terminal.

> But regardless, I cannot replicate the behavior you show where list(s) is different from indexing the characters one by one.

That is what surprised me the most. I just ran into this because this was somehow generated when fuzz testing my code using hypothesis (which uncovered an unrelated bug in my application). However I was quite confused by the character order when debugging it.

My original case was:
```
s1='00ĀĀĀĀ'
s2='9010𐤀000\x8dÀĀĀĀ222Ā'
parts = [s2[max(0, i) : min(len(s2), i+len(s1))] for i in range(-len(s1), len(s2))]
for part in parts:
    print(list(part))
```
which produced
```
[]
['9']
['9', '0']
['9', '0', '1']
['9', '0', '1', '0']
['9', '0', '1', '0', '𐤀']
['9', '0', '1', '0', '𐤀', '0']
['0', '1', '0', '𐤀', '0', '0']
['1', '0', '𐤀', '0', '0', '0']
['0', '𐤀', '0', '0', '0', '\x8d']
['𐤀', '0', '0', '0', '\x8d', 'À']
['0', '0', '0', '\x8d', 'À', 'Ā']
['0', '0', '\x8d', 'À', 'Ā', 'Ā']
['0', '\x8d', 'À', 'Ā', 'Ā', 'Ā']
['\x8d', 'À', 'Ā', 'Ā', 'Ā', '2']
['À', 'Ā', 'Ā', 'Ā', '2', '2']
['Ā', 'Ā', 'Ā', '2', '2', '2']
['Ā', 'Ā', '2', '2', '2', 'Ā']
['Ā', '2', '2', '2', 'Ā']
['2', '2', '2', 'Ā']
['2', '2', 'Ā']
['2', 'Ā']
['ĀÀ]
```
which has a missing single quote:
  - ['ĀÀ]
changing direction of characters including commas:
  - ['1', '0', '𐤀', '0', '0', '0']
changing direction back:
  - ['𐤀', '0', '0', '0', '\x8d', 'À']

> AFAICT, there is no bug here. It's just confusing how Unicode right-to-left characters in the repr() can modify how it's displayed in the console/terminal.

Yes it appears the same confusion occurs in other applications like Firefox and VS Code.
Thanks at @eryksun and @steven.daprano for testing and telling me about Bidirectional writing in Unicode (The more I know about Unicode the more it scares me)
msg401090 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-09-05 15:53
> what's really there when I copy it from Firefox is '0\U0001090000', 
> which matches the result Max gets for individual index operations such as s[1]. 

But *not* the result that Max got from calling list().

Can you reproduce that difference between indexing and list?

Also you say "what's really there", but what is your reasoning for that? 
How do you know that Firefox is displaying the string wrongly, rather 
than displaying it correctly and copying it to the clipboard wrongly?

When I look at the page source of the b.p.o page, I see:

    <pre>I noticed that when using the Unicode character \U00010900 when 
    inserting the character as character:
    Here is the result on the Python console both for 3.6 and 3.9:
    ```
    &gt;&gt;&gt; s = '000X'

again, with X standing in for the Phoenician ALF character. But when I 
copy and paste it into my terminal, I see

    &gt;&gt;&gt; s = '0X00'
msg401092 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-09-05 16:01
Hmmm, digging deeper, I saved the page source code and opened it with 
hexdump. The relevant lines are:

00007780  60 60 0d 0a 26 67 74 3b  26 67 74 3b 26 67 74 3b  |``..&gt;&gt;&gt;|
00007790  20 73 20 3d 20 27 30 f0  90 a4 80 30 30 27 0d 0a  | s = '0....00'..|

which looks like Eryk Sun is correct, what is really there is '0X00' and 
Firefox just displays it in RTL order '000X'.

Mystery solved :-)

So now that only leaves the (unreproduced?) bug report of the difference 
in order between indexing and list(). Max, are you still certain that 
this difference exists? Can you replicate it with other strings, 
preferably with distinct characters?
msg401095 - (view) Author: Max Bachmann (maxbachmann) * Date: 2021-09-05 16:32
As far as a I understood this is caused by the same reason:

```
>>> s = '123\U00010900456'
>>> s
'123𐤀456'
>>> list(s)
['1', '2', '3', '𐤀', '4', '5', '6']
# note that everything including the commas is mirrored until ] is reached
>>> s[3]
'𐤀'
>>> list(s)[3]
'𐤀'
>>> ls = list(s)
>>> ls[3] += 'a'
>>> ls
['1', '2', '3', '𐤀a', '4', '5', '6']
```

Which as far as I understood is the expected behavior when a right-to-left character is encountered.
msg401656 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2021-09-12 08:38
@Steven: the difference between indexing and the repr of list() is also explained by Eryk's explanation.

s = ... # (value from msg401078)
for x in repr(list(s)):
   print(x)

The output shows characters in the expected order.
msg401658 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-09-12 08:57
We recently discussed the RTLO attack on Python sources (sorry, I don't remember on what resource) and decided that we should do something with this. I think this is a related issue.
History
Date User Action Args
2021-09-12 08:57:14serhiy.storchakasetmessages: + msg401658
2021-09-12 08:38:45ronaldoussorensetnosy: + ronaldoussoren
messages: + msg401656
2021-09-10 22:27:49terry.reedysetnosy: + serhiy.storchaka
2021-09-06 16:40:21vstinnersetnosy: - vstinner
2021-09-05 16:32:08maxbachmannsetmessages: + msg401095
2021-09-05 16:01:40steven.dapranosetmessages: + msg401092
2021-09-05 15:53:23steven.dapranosetmessages: + msg401090
2021-09-05 14:51:42maxbachmannsetstatus: pending -> open

messages: + msg401088
2021-09-05 13:57:48serhiy.storchakasetstatus: open -> pending
2021-09-05 13:56:38eryksunsetmessages: + msg401086
2021-09-05 13:15:48steven.dapranosetmessages: + msg401084
2021-09-05 13:08:09steven.dapranosetfiles: + selection.png
nosy: + steven.daprano
messages: + msg401083

2021-09-05 12:53:10eryksunsetnosy: + eryksun
messages: + msg401082
2021-09-05 11:18:39maxbachmannsetmessages: + msg401079
2021-09-05 11:12:09maxbachmanncreate