This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author jberg
Recipients Neui, SilentGhost, eryksun, jberg, ncoghlan
Date 2020-05-24.19:28:05
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1590348485.34.0.0647952861096.issue35883@roundup.psfhosted.org>
In-reply-to
Content
Like I said above, it could be argued that the bug is in glibc, and then

https://p.sipsolutions.net/6a4e9fce82dbbfa0.txt

could be used as a simple LD_PRELOAD wrapper to work around this, just to illustrate the problem from that side.


Arguably, that makes glibc in violation of RFC 3629, since it says:


3.  UTF-8 definition

[...]

   In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
   accessible range) are encoded using sequences of 1 to 4 octets.

[...]

      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

[...]

   Implementations of the decoding algorithm above MUST protect against
   decoding invalid sequences.

[...]

Here's a simple test program:

https://p.sipsolutions.net/ac091b4ea4b7f742.txt
History
Date User Action Args
2020-05-24 19:28:05jbergsetrecipients: + jberg, ncoghlan, SilentGhost, eryksun, Neui
2020-05-24 19:28:05jbergsetmessageid: <1590348485.34.0.0647952861096.issue35883@roundup.psfhosted.org>
2020-05-24 19:28:05jberglinkissue35883 messages
2020-05-24 19:28:05jbergcreate