This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients Neui, SilentGhost, eryksun, ezio.melotti, jberg, ncoghlan, vstinner
Date 2021-03-13.13:10:42
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
I wrote PR 24843 to fix this issue. With this fix, os.fsencode(sys.argv[1]) returns the original byte sequence as expected.


I dislike the replace error handler since it loses information. The PEP 383 surrogateescape error handler exists to prevent losing information.

The root issue is that Py_DecodeLocale() creates wide characters outside Python Unicode valid range: [U+0000; U+10ffff].

On Linux, Py_DecodeLocale() usually calls mbstowcs() of the C library. The problem is that the the glibc UTF-8 decoder doesn't respect the RFC 3629, it doesn't reject characters outside [U+0000; U+10ffff] range. The following issue requests to change the glibc UTF-8 codec to respect the RFC 3629, but it's open since 2006:

Even if the glibc changes, Python should behave the same on old glibc version.

My PEP modifies Py_DecodeLocale() to check if there are characters outside [U+0000; U+10ffff] range and use the surrogateescape error handler in that case.
Date User Action Args
2021-03-13 13:10:44vstinnersetrecipients: + vstinner, ncoghlan, ezio.melotti, SilentGhost, eryksun, Neui, jberg
2021-03-13 13:10:43vstinnersetmessageid: <>
2021-03-13 13:10:43vstinnerlinkissue35883 messages
2021-03-13 13:10:42vstinnercreate