This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Display of Unicode strings with bidi characters
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, steven.daprano, terry.reedy, xxm
Priority: normal Keywords:

Created on 2020-11-08 05:51 by xxm, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (3)
msg380534 - (view) Author: Xinmeng Xia (xxm) Date: 2020-11-08 05:51
When printing an assignment expression with unicode ܯ ( \U+072F)  on the command line, we get an unexpected result.
Example A:
>>> print(chr(1839)+" = 1")
ܯ = 1

Similar problems exist in plenty of characters of unicode.
msg380536 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2020-11-08 07:28
Works for me:

>>> chr(1839)+'1'
'ܯ1'

You are mixing a right-to-left code point (DHALATH) with a left-to-right code point (digit 1). The result depends on the quality of your console or terminal. Try using a different terminal.

On my system, the terminal displays the DHALATH on the left, and the digit 1 on the right; when pasted into my browser, it displays them in the reverse order. I don't know which is correct: bidirectional text is complex and I don't know the rules for mixing characters with different bidirection classes.

But whichever display is correct, this has nothing to do with Python. It depends on the quality of the bidirectional text rendering of the browser and the terminal.

If your terminal displays the wrong results, that's a bug in the terminal. What terminal are you using, in what OS? Try using a different terminal.

You can check that Python is doing the right thing:


>>> s = chr(1839)+'1'
>>> s == '\N{SYRIAC LETTER PERSIAN DHALATH}1'
True

If your system reports True, then Python has made the string you asked for, and the result of printing depends on the capabilities of the terminal, and the available glyphs in the typeface used by the terminal. There's nothing Python can do about that.
msg380943 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-11-13 23:42
Xia, when saying 'unexpected', one usually needs to also say what was expected.  When discussing mixed direction chars, we need to be especially careful in describing what we see with different terminals, different browsers, and different OSes.

Steven: On Windows, I see the same thing: "Daleth 1" prints as that in both IDLE's Shell and Python's REPL in Command Prompt (with D a replacement box in the latter) but is reversed here 'ܯ1' in Firefox (and the same in Microsoft Edge.  But, I just discovered, the two browsers (and Notepad and LibreOffice Writer and likely other text editors) treat runs of latin digits specially: "Daleth a" pastes in that order, 'ܯa', and "Daleth 1 2" pastes as "1 2 Daleth", 'ܯ12'.

The block, but not the individual digits, is reversed.  This allows R2L writers to use what are now the global digits.  In Arabic, numbers are written and read R 2 L low order to high.  So Europeans used to writing and reading L 2 R high to low kept the same order.  Perhaps the bidi property of the digits in the unicode datebase is different from that of other latin chars.

It seems that '=' is also bidirectional, but properly not treated as digit.  "Daleth = 1" is reversed in both browsers and text editors to read 'Daleth' 'equals' 'one' when read right to left.

The general rule is that blocks of same direction chars are written appropriately as encountered.  It seems that the classification of some characters depends on the context.  The following is as expected,
>>> 'ab'+chr(1837)+chr(1838)+chr(1839)+'cd'
'abܭܮܯcd'
with the R2L triplet reversed.

In any case, Steven is correct that Python correctly stores chars in the  order given and that there is no Python bug.
History
Date User Action Args
2022-04-11 14:59:37adminsetgithub: 86456
2020-11-13 23:43:26terry.reedysettitle: Unicode inconsistent display after concencated -> Display of Unicode strings with bidi characters
2020-11-13 23:42:38terry.reedysetstatus: open -> closed

nosy: + terry.reedy
messages: + msg380943

resolution: not a bug
stage: resolved
2020-11-09 14:56:10vstinnersetnosy: - vstinner
2020-11-08 07:28:04steven.dapranosetnosy: + steven.daprano
messages: + msg380536
2020-11-08 05:51:50xxmcreate