This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: ascii codec is used by default when LANG is not set
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, lemburg, od-cea
Priority: normal Keywords:

Created on 2021-09-17 12:30 by od-cea, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (6)
msg402046 - (view) Author: Olivier Delhomme (od-cea) Date: 2021-09-17 12:30
$ python3 --version
Python 3.6.4

Setting LANG to en_US.UTF8 works like a charm

$ export LANG=en_US.UTF8   
$ python3
Python 3.6.4 (default, Jan 11 2018, 16:45:55) 
[GCC 4.8.5] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> machaine='ééééhelp me if you can'                                                                                                                                                                                
>>> print('{}'.format(machaine))                                                                                                                                                                                     
ééééhelp me if you can


Unsetting LANG shell variable fails the program:

$ unset LANG
$ python3
Python 3.6.4 (default, Jan 11 2018, 16:45:55) 
[GCC 4.8.5] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> machaine='ééééhelp me if you can'
  File "<stdin>", line 0
    
    ^
SyntaxError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)


Setting LANG inside the program does not change this behavior:

$ unset LANG
$ python3
Python 3.6.4 (default, Jan 11 2018, 16:45:55) 
[GCC 4.8.5] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['LANG'] = 'en_US.UTF8'
>>> machaine='ééééhelp me if you can'
  File "<stdin>", line 0
    
    ^
SyntaxError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)


Is this an expected behavior ? How can I force an utf8 codec ?
msg402051 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2021-09-17 13:19
Yes, this is intended. ASCII is used as fallback in case Python
cannot determine the I/O encoding to use during startup. This is
also the reason why later changes to the environment have no
affect on this - the determination of the encoding has already
been applied.

You can force UTF-8 by enabling the UTF-8 mode:

export PYTHONUTF8=1

This will then have Python use UTF-8 regardless of the LANG
env var setting.
msg402054 - (view) Author: Olivier Delhomme (od-cea) Date: 2021-09-17 13:45
Hi Marc-Andre,

Please note that setting PYTHONUTF8 with "export PYTHONUTF8=1":

* Is external to the program and user dependent
* It does not seems to work on my use case:

  $ unset LANG
  $ export PYTHONUTF8=1
  $ python3 
  Python 3.6.4 (default, Jan 11 2018, 16:45:55) 
  [GCC 4.8.5] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> machaine='ééééhelp me if you can'
     File "<stdin>", line 0
    
       ^
   SyntaxError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)


Regards,

Olivier.
msg402058 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2021-09-17 14:10
On 17.09.2021 15:45, Olivier Delhomme wrote:
> 
> Olivier Delhomme <olivier.delhomme@cea.fr> added the comment:
> 
> Hi Marc-Andre,
> 
> Please note that setting PYTHONUTF8 with "export PYTHONUTF8=1":
> 
> * Is external to the program and user dependent
> * It does not seems to work on my use case:
> 
>   $ unset LANG
>   $ export PYTHONUTF8=1
>   $ python3 
>   Python 3.6.4 (default, Jan 11 2018, 16:45:55) 
>   [GCC 4.8.5] on linux
>   Type "help", "copyright", "credits" or "license" for more information.
>   >>> machaine='ééééhelp me if you can'
>      File "<stdin>", line 0
>     
>        ^
>    SyntaxError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)

UTF-8 mode is only supported in Python 3.7 and later:

   https://docs.python.org/3/whatsnew/3.7.html#whatsnew37-pep540
-- 
Marc-Andre Lemburg
eGenix.com
msg402059 - (view) Author: Olivier Delhomme (od-cea) Date: 2021-09-17 14:54
>> Hi Marc-Andre,
>>
>> Please note that setting PYTHONUTF8 with "export PYTHONUTF8=1":
>>
>> * Is external to the program and user dependent
>> * It does not seems to work on my use case:
>>
>>    $ unset LANG
>>    $ export PYTHONUTF8=1
>>    $ python3
>>    Python 3.6.4 (default, Jan 11 2018, 16:45:55)
>>    [GCC 4.8.5] on linux
>>    Type "help", "copyright", "credits" or "license" for more information.
>>    >>> machaine='ééééhelp me if you can'
>>       File "<stdin>", line 0
>>      
>>         ^
>>     SyntaxError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)
> 
> UTF-8 mode is only supported in Python 3.7 and later:
> 
>     https://docs.python.org/3/whatsnew/3.7.html#whatsnew37-pep540

Oh. Thanks.

$ unset LANG
$ export PYTHONUTF8=1
$ python3
Python 3.7.5 (default, Dec 24 2019, 08:52:13)
[GCC 4.8.5] on linux
Type "help", "copyright", "credits" or "license" for more information.
 >>> machaine='ééééhelp me if you can'
 >>>

 From the code point of view:

$ unset LANG
$ unset PYTHONUTF8
$ python3
Python 3.7.5 (default, Dec 24 2019, 08:52:13)
[GCC 4.8.5] on linux
Type "help", "copyright", "credits" or "license" for more information.
 >>> import os
 >>> os.environ['PYTHONUTF8'] = '1'
 >>> machaine='ééééhelp me if you can'
 >>>

Even better:

$ unset LANG
$ unset PYTHONUTF8
$ python3
Python 3.7.5 (default, Dec 24 2019, 08:52:13)
[GCC 4.8.5] on linux
Type "help", "copyright", "credits" or "license" for more information.
 >>> machaine='ééééhelp me if you can'
 >>>

Works as expected. Thank you very much. You can close this bug report.

Regards,

Olivier.
msg402062 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-09-17 15:32
Python 3.7+ doesn't need to explicitly enable UTF-8 mode in this case on POSIX systems. If the locale encoding is the "POSIX" or "C" locale, and "C" locale coercion is not disabled via LC_ALL or PYTHONCOERCECLOCALE=0, the interpreter tries to coerce the LC_CTYPE locale to "C.UTF-8", "C.utf8", or "UTF-8". If these attempts fail, or if coercion is disabled, the interpreter will automatically enable UTF-8 mode, unless that's also explicitly disabled. For example:

    $ unset LANG
    $ unset LC_ALL
    $ unset PYTHONCOERCECLOCALE
    $ unset PYTHONUTF8         
    $ python -c 'import locale; print(locale.getpreferredencoding())'
    UTF-8

    $ PYTHONCOERCECLOCALE=0 python -c 'import locale; print(locale.getpreferredencoding())'
    UTF-8

    $ PYTHONUTF8=0 python -c 'import locale; print(locale.getpreferredencoding())'
    UTF-8

    $ PYTHONCOERCECLOCALE=0 PYTHONUTF8=0 python -c 'import locale; print(locale.getpreferredencoding())'
    ANSI_X3.4-1968
History
Date User Action Args
2022-04-11 14:59:50adminsetgithub: 89395
2021-09-17 15:32:18eryksunsetnosy: + eryksun
messages: + msg402062
2021-09-17 15:09:12lemburgsetstatus: open -> closed
resolution: not a bug
stage: resolved
2021-09-17 14:54:29od-ceasetmessages: + msg402059
2021-09-17 14:10:40lemburgsetmessages: + msg402058
2021-09-17 13:45:45od-ceasetmessages: + msg402054
2021-09-17 13:19:59lemburgsetnosy: + lemburg
messages: + msg402051
2021-09-17 12:30:29od-ceacreate