This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: [EASY] Missing code page aliases: "unknown encoding: 874"
Type: crash Stage: resolved
Components: Unicode Versions: Python 3.6
process
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, methane, ronaldoussoren, serhiy.storchaka, steven.daprano, vstinner, winvinc, xtreak
Priority: normal Keywords: easy, patch

Created on 2018-06-15 00:39 by winvinc, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
Capture.PNG winvinc, 2018-06-15 00:39
33865.patch xtreak, 2018-06-16 12:01
Pull Requests
URL Status Linked Edit
PR 7705 closed xtreak, 2018-06-15 12:28
Messages (26)
msg319569 - (view) Author: Prawin Phichitnitikorn (winvinc) Date: 2018-06-15 00:39
This Error "
Current thread 0x0000238c (most recent call first): Fatal Python error: Py_Initialize: can’t initialize sys standard streams LookupError: unknown encoding: 874"

is cause by mapping of 874 encodling is missing in encodings\aliases.py
msg319570 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-06-15 01:09
Please don't post screenshots of text, they make it difficult for the blind and visually impaired to contribute. Instead, please copy and paste the error message into the body of your bug report. (Which I see you have done, which makes the screenshot unnecessary.)

Just reporting the error message alone is not very useful, we also should see the context of what you were doing when the error occurred.
msg319589 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-06-15 07:12
@stephen: Lib/encoding/aliases.py contains aliases for a (largish) number of encoding names, including both "cpXXXX" and "XXXX" for most windows code pages. For code page 874 only the name "cp874" can be used and not "874", which apparently causes problems.

@Prawin: have you added an alias to aliases.py to check if adding an alias would fix the problem you're having?
msg319590 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-15 07:53
It seems like the following code pages have a Python codec (Lib/encoding/cpXXX.py) but lack an alias in Lib/encodings/aliases.py:

[720, 737, 856, 874, 875, 1006, 65001]

Is someone volunteer to write a pull request for that? It should be easy.

Example of a correct alias in Lib/encodings/aliases.py:

    # cp1252 codec
    '1252'               : 'cp1252',
    'windows_1252'       : 'cp1252',
msg319611 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-06-15 12:33
I have added the aliases as per comment by @vstinner https://bugs.python.org/msg319590 . I have used https://docs.python.org/3.8/library/codecs.html#standard-encodings as a reference to see if there are any additional aliases to add with respect to the second column. I am a beginner in contributing to cpython and hence please let me know if I have missed something or any way to test this.

PR : https://github.com/python/cpython/pull/7705

Thanks
msg319612 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-06-15 12:44
Could you also add a documentation update and a news entry? 

The section on standard encodings mentions aliases for standard encodings, and IMHO the new aliases should be added to that page. 

Creating a new entry is described here: https://devguide.python.org/committing/?highlight=blurb#what-s-new-and-news-entries
msg319613 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-06-15 13:12
Thanks @ronaldoussoren for the links. I have added an entry using blurb tool and updated the docs at Doc/library/codecs.rst with relevant aliases.

Thanks
msg319617 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-15 14:30
Why only these code pages? There are other cpXXXX encodings that don't have the XXXX alias.

Maybe add a logic in encodings.search_function() that will map XXXX to cpXXXX if it is all digits? Maybe even map ibmXXXX and windows_XXXX to cpXXXX, but this will create false aliases like ibm1252 and windows_437.
msg319716 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-06-16 05:14
There are certain encodings as I went through the file Lib/encodings/aliases.py  where there are all digit items that doesn't correspond to cpXXXX sequence. I think the search function is used not only for encodings that start with 'cp' and thus adding the logic might result in checks for extra cases.

Sample cases : 

'936'                : 'gbk'
'8859'               : 'latin_1'
'646'                : 'ascii'

I also have limited knowledge on working through encodings/__init__.py so correct me if I am wrong on the above.

Thanks.
msg319717 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-16 05:20
Of course entries in the alias table should have a precedence.
msg319725 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-06-16 07:37
Thanks @serhiy.storchaka . I looked into the code and it seems the resolution is done in `search_function` at Lib/encodings/__init__.py . It seems that encoding is normalized using some logic and then we use the normalized encoding to check against aliases which is the dictionary where I have added the alias. If it's not found then '.' is replaced with '_' to check again. I hope this is the place where I need to check if aliased_encoding is None after both attempts and norm_encoding is all digits then prepend "cp" to norm_encoding to check again against `aliases` dictionary. Unfortunately, print and pdb doesn't work inside the function and I don't know how to test this change or write test cases for the same.

Any pointers will be highly helpful.

Thanks
msg319728 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-16 07:51
It is easy to test it. Encoding/decoding with '874' should give the same result as with 'cp874'.
msg319733 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-06-16 12:01
I am able to verify the newly added aliases using the below assert statement

assert codecs.encode('a', '874') == codecs.encode('a', 'cp874')

I am struck on the part where it could be patched in the search_function and I hope this is the approach @serhiy.storchaka was making. After the usual logic I am checking if the aliased_encoding is None and if the normalized_encoding is all digits then I am prepending 'cp' in front and calling search_function again so that cases like '936' first look at the table which has higher precedence and then for other cases even though an entry is not present it returns 'cpXXXX' encoder. 

I have tested it by removing newly added '874' from aliases.py so that instead of an error 'cp874' is returned. Since in the next call the case of encoding being digits is not valid due to prepending 'cp' there will be no error due to infinite recursion for wrong ones.

Thanks
msg319877 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-06-18 11:19
I'm not convinced that adding code to search_function is the right solution for this. 

BTW. I'm also not sure yet why this error happens, does windows return a codepage number as the preferred encoding when the io module looks for one? If so, wouldn't it be better to correct the encoding name there (from the codepage number to a string with a "cp" prefix)?
msg319881 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-06-18 13:16
I think if we can get a confirmation from @Prawin that adding an alias fixed the issue or a minimal test case then it will be helpful. The minimal I can come up with is as below :  

import codecs

# Fails without alias being added other cases like 1252 pass because of alias

assert codecs.encode('a', '874') == codecs.encode('a', 'cp874')

# Below assertion passes after search_function patch though alias is not added since I prepend cp in search_function

assert codecs.encode('a', '874') == codecs.encode('a', 'cp874')


Thanks
msg319883 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-06-18 13:37
Confirmation that the patch actually fixes the problem would be nice, but I'd still like to understand why Python tries to use an encoding with the name "874" as this might lead to a nicer solution to the problem.

BTW. There is some discussion on this issue on the python-ideas mailinglist.
msg319926 - (view) Author: Prawin Phichitnitikorn (winvinc) Date: 2018-06-19 03:57
Sorry for late Reply,

But for me I'm resolve by adding 

# cp874 codec
'874'                : 'cp874',

to alias.py file
msg319927 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-06-19 05:00
Thanks @prawin for the confirmation. There is a mailing list discussion at https://groups.google.com/forum/#!topic/python-ideas/Ny1RN9wY0cI and it seems this is related to Thai language locale. Feel free to add in if you have any more input on if it's reproducible in maybe other machines of Thai locale or so on. There is a PR that adds alias along with other missing items but I will wait for others to chime in to see if there is a better solution to fix this.

Thanks.
msg319949 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-06-19 10:21
In particular, we're interested in the following information:

* What OS is installed on your machine?
* What locale (country/language) is configured?
* What does "import locale; print(locale._getdefaultlocale())" print?
msg319950 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-19 10:25
* Does you use a regular Python interpreter or embedded in other program?
msg319976 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-06-19 15:01
@Serhiy: The screenshot suggests that this is regular python install.
msg319978 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-19 15:17
Prawin Phichitnitikorn: "But for me I'm resolve by adding (...)"

Ok, so can you please give the value of:

* sys.stdin.encoding
* sys.stdout.encoding
* sys.stderr.encoding
* os.device_encoding(0)
* os.device_encoding(1)
* os.device_encoding(2)
* locale.getpreferredencoding(False)

Maybe also the .errors attribute of sys.stdin, sys.stdout and sys.stderr.
msg320424 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-06-25 16:06
When I grepped "Unknown encoding 874", I see some people got trouble from anaconda installation.

I don't know about what anaconda setup does, but it will not happen on normal CPython.
We use UTF-8 by default on Windows, for fsencoding and console encoding, from Python 3.6.
msg320425 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-06-25 16:19
I grepped PYTHONIOENCODING and found this line.
https://github.com/conda/conda/blob/082fe8fd7458ecd9dd7547749039f4b1f06d76db/conda/activate.py#L726
msg320426 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-06-25 16:37
I found original pull request and issue report

https://github.com/conda/conda/pull/4558
https://github.com/ContinuumIO/anaconda-issues/issues/1410
msg320429 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-25 16:50
Thank you Inada-san! Seems this issue can be closed as a third party issue.
History
Date User Action Args
2022-04-11 14:59:01adminsetgithub: 78046
2018-07-12 13:50:50vstinnersetstatus: open -> closed
resolution: third party
stage: patch review -> resolved
2018-06-25 16:50:54serhiy.storchakasetmessages: + msg320429
2018-06-25 16:37:33methanesetmessages: + msg320426
2018-06-25 16:19:21methanesetmessages: + msg320425
2018-06-25 16:06:23methanesetmessages: + msg320424
2018-06-25 15:59:09methanesetnosy: + methane
2018-06-19 15:17:06vstinnersetmessages: + msg319978
2018-06-19 15:01:25ronaldoussorensetmessages: + msg319976
2018-06-19 10:25:28serhiy.storchakasetmessages: + msg319950
2018-06-19 10:21:29ronaldoussorensetmessages: + msg319949
2018-06-19 05:00:47xtreaksetmessages: + msg319927
2018-06-19 03:57:25winvincsetmessages: + msg319926
2018-06-18 13:37:56ronaldoussorensetmessages: + msg319883
2018-06-18 13:16:22xtreaksetmessages: + msg319881
2018-06-18 11:19:40ronaldoussorensetmessages: + msg319877
2018-06-16 12:01:24xtreaksetfiles: + 33865.patch

messages: + msg319733
2018-06-16 07:51:58serhiy.storchakasetmessages: + msg319728
2018-06-16 07:37:51xtreaksetmessages: + msg319725
2018-06-16 05:20:34serhiy.storchakasetmessages: + msg319717
2018-06-16 05:14:55xtreaksetmessages: + msg319716
2018-06-15 14:30:05serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg319617
2018-06-15 13:12:20xtreaksetmessages: + msg319613
2018-06-15 12:44:55ronaldoussorensetmessages: + msg319612
2018-06-15 12:33:09xtreaksetnosy: + xtreak
messages: + msg319611
2018-06-15 12:28:43xtreaksetkeywords: + patch
stage: patch review
pull_requests: + pull_request7321
2018-06-15 07:53:42vstinnersetkeywords: + easy

messages: + msg319590
title: unknown encoding: 874 -> [EASY] Missing code page aliases: "unknown encoding: 874"
2018-06-15 07:12:37ronaldoussorensetnosy: + ronaldoussoren
messages: + msg319589
2018-06-15 01:09:00steven.dapranosetnosy: + steven.daprano
messages: + msg319570
2018-06-15 00:39:11winvinccreate