This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: reversing a Unicode ligature doesn't work
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.4
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, ezio.melotti, larry, vstinner
Priority: low Keywords:

Created on 2013-11-27 23:51 by larry, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (5)
msg204628 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2013-11-27 23:51
Read this today:

http://mortoray.com/2013/11/27/the-string-type-is-broken/

In it the author talks about how the 'ffl' ligature breaks some string processing.  He claimed that Python 3 doesn't uppercase it correctly--well, it does.  However I discovered that it doesn't reverse it properly.

    x = b'ba\xef\xac\x84e'.decode('utf-8') # "baffle", where "ffl" is a ligature
    print(x) # prints "baffle", with the ligature
    print(x.upper())  # prints "BAFFLE", no ligature, which is fine
    print("".join(reversed(x))) # prints "efflab"

Shouldn't that last line print "elffab"?

If this gets marked as "wontfix" I wouldn't complain.  Just wondering what the Right Thing is to do here.
msg204629 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-11-28 00:07
There is no ligature for "lff", just "ffl". Ligatures are treated as one char. I guess Python would have to grow a str.reverse() method to handle ligatures and combining chars correctly.

At work I ran into the issue with ligatures and combining chars multiple times in medieval and early modern age scripts. Eventually I started to normalize all incoming data to NFKC. That solves most of the issues.

s = b'ba\xef\xac\x84e'.decode('utf-8')
>>> print("".join(reversed(s)))
efflab
>>> print("".join(reversed(unicodedata.normalize("NFKC", s))))
elffab
msg204630 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-11-28 00:16
A proper str.reverse function must deal with more extra cases. For example there are special rules for the Old German long s (ſ) and the round s (s). A round s may only occur at the end of a syllable. Hebrew has a special variant of several characters if the character is placed at the end of a word (HEBREW LETTER PE / HEBREW LETTER FINAL PE).

A simple reversed(s) can never deal with all the complicated rules.
msg204631 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-28 00:22
Python implements the Unicode standards. Except if Python failed to implement the standard correctly, the author should complain to the Unicode Consortium directly!
http://www.unicode.org/contacts.html

Example of data for the "ffl" character, U+FB04:

FB04;LATIN SMALL LIGATURE FFL;Ll;0;L;<compat> 0066 0066 006C;;;;N;;;;;

http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt

(I'm unable to decode these raw data :-))
msg204632 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-28 00:33
I don't understand the purpose of using reversed(). Don't use it to display a text backward. Handling bidirectional text requires more complex tools to display such text. See for example the pango library:
https://developer.gnome.org/pango/stable/pango-Bidirectional-Text.html

I don't see anything wrong with Python here, it just implements the Unicode standards, so I'm closing the issue as invalid.
History
Date User Action Args
2022-04-11 14:57:54adminsetgithub: 64018
2013-11-28 14:26:49ezio.melottisetnosy: + ezio.melotti

components: + Unicode
stage: needs patch -> resolved
2013-11-28 00:33:12vstinnersetstatus: open -> closed
resolution: not a bug
messages: + msg204632
2013-11-28 00:22:18vstinnersetnosy: + vstinner
messages: + msg204631
2013-11-28 00:16:31christian.heimessetmessages: + msg204630
2013-11-28 00:07:37christian.heimessetnosy: + christian.heimes
messages: + msg204629
2013-11-27 23:51:14larrycreate