This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: python3.6.4 os.environ error when write chinese to file
Type: behavior Stage: resolved
Components: Interpreter Core, IO, Library (Lib), Unicode Versions: Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ezio.melotti, rushant, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2021-03-21 04:46 by rushant, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (6)
msg389215 - (view) Author: rushant (rushant) Date: 2021-03-21 04:46
# -*- coding: utf-8 -*-
import os
job_name = os.environ['a']
print(job_name)
print(isinstance(job_name, str))
print(type(job_name))
with open('name.txt', 'w', encoding='utf-8')as fw:
    fw.write(job_name)


i have set environment param by :
export a="中文"
it returns error:
中文
True
<class 'str'>
Traceback (most recent call last):
  File "aa.py", line 8, in <module>
    fw.write(job_name)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-5: surrogates not allowed
msg389560 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-03-26 19:09
3.6 only gets security patches.  You or someone needs to show an unfixed bug in master.  Your code runs for me on Windows, whereas you appear to be using *nix.  Replacing job_name.encode() should have the same behavior.  Do you see the same with job_name="中文" at the top instead?
msg389568 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-03-26 20:45
I think this is a locale configuration problem, in which the locale encoding doesn't match the terminal encoding. If so, it can be closed as not a bug.

> export a="中文"

In POSIX, the shell reads "中文" from the terminal as bytes encoded in the terminal encoding, which could be UTF-8 or some legacy encoding. The value of `a` is set directly as this encoded text. There is no intermediate decode/encode stage in the shell. For a child process that decodes the value of the environment variable, as Python does, the locale's LC_CTYPE encoding should be the same or compatible with the terminal encoding.

> job_name = os.environ['a']
> print(job_name)

In POSIX, sys.stdout.errors, as used by print(), will be "surrogateescape" if the default LC_CTYPE locale is a legacy locale -- which in 3.6 is the case for the "C" locale, since it's usually limited to 7-bit ASCII. "surrogateescape" is also the errors handler for decoding bytes os.environb (POSIX) as text os.environ. When decoding, "surrogateescape" handles non-ASCII byte values that can't be decoded by translating the value into the reserved surrogate range U+DC80 - U+DCFF. When encoding, it translates each surrogate code back to the original byte value in the range 0x80 - 0xFF. 

Given the above setup, byte sequences in os.environb that can't be decoded with the default LC_CTYPE locale encoding will be surrogate escaped in the decoded text  The surrogate-escaped values roundtrip back to bytes when printed, presumably as the terminal encoding.

> with open('name.txt', 'w', encoding='utf-8')as fw:
>    fw.write(job_name)

The default errors handler for open() is "strict" instead of "surrogateescape", so the surrogate-escaped values in job_name cause the encoding to fail.

> Your code runs for me on Windows

In Windows, Python uses the wide-character (16-bit wchar_t) environment of the process for os.environ, and, in 3.6+, it uses the console session's wide-character API for console files such as sys.std* when they aren't redirected to a pipe or disk file. Conventionally, wide-character strings should be valid UTF-16LE text. So getting "中文" from os.environ and printing it should 'just work'. The output will even be displayed correctly if the console session uses a font that supports "中文", or if it's a pseudoconsole (conpty) session that's attached to a terminal that supports automatic font fallback, such as Windows Terminal.
msg389571 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-26 23:12
Python works as expected: the UTF-8 codec doesn't allow to encode surrogate characters.

Surrogate characters are coming from os.environ['a'] because this environment variable contains bytes which cannot be decoded from the sys.getfilesystemencoding().

You should fix your system setup, especially the locale encoding. The strings stored in the "a" environment variable was not encoded to the Python filesystem encoding:
https://docs.python.org/dev/glossary.html#term-filesystem-encoding-and-error-handler

If you are lost with locale encodings, you can attempt to encode everything in UTF-8 and enables the Python UTF-8 Mode:
https://docs.python.org/dev/library/os.html#python-utf-8-mode

Good luck with your setup ;-)

Hint: use print(ascii(job_name)) to dump the string content.
msg389572 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-26 23:13
Oh, I forgot to note that Windows is not affected by this issue, since Windows provides directly environment variables as Unicode, and so Python doesn't need to decode byte strings to read os.environ['a'] ;-)
msg389579 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-03-27 00:06
> If you are lost with locale encodings, you can attempt to encode
> everything in UTF-8 and enables the Python UTF-8 Mode:

rushant is using Python 3.6. UTF-8 mode was added in 3.7, so it's not an option without first upgrading to 3.7. Also, it's important to note that the suggestion to "attempt to encode everything in UTF-8" includes whatever terminal encoding or shell-script file encoding is used for `export a="中文"`. If it's not using UTF-8, then setting the preferred encoding in Python to UTF-8 isn't going to help.
History
Date User Action Args
2022-04-11 14:59:43adminsetgithub: 87742
2021-03-27 00:06:43eryksunsetmessages: + msg389579
2021-03-26 23:13:42vstinnersetmessages: + msg389572
2021-03-26 23:12:04vstinnersetstatus: open -> closed
resolution: not a bug
messages: + msg389571

stage: resolved
2021-03-26 20:45:35eryksunsetnosy: + ezio.melotti, vstinner, eryksun
messages: + msg389568
components: + Interpreter Core, Library (Lib), Unicode, IO, - C API
2021-03-26 19:09:18terry.reedysetnosy: + terry.reedy
messages: + msg389560
2021-03-21 04:46:52rushantcreate