This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Windows: socket.gethostbyaddr(name) fails for non-ASCII hostname
Type: behavior Stage: resolved
Components: Library (Lib), Unicode, Windows Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: steve.dower Nosy List: abarry, eryksun, ezio.melotti, miss-islington, paul.moore, python-dev, serhiy.storchaka, steve.dower, tim.golden, williamdias, zach.ware, Владимир Мартьянов
Priority: high Keywords: 3.10regression, 3.6regression, 3.7regression, 3.8regression, 3.9regression, patch

Created on 2016-01-28 00:58 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
gethostbyaddr_encoding.patch vstinner, 2016-01-28 00:59
gethostbyaddr_encoding-2.patch vstinner, 2016-01-28 08:49 review
gethostbyaddr_encoding-3.patch vstinner, 2016-01-28 09:41 review
Pull Requests
URL Status Linked Edit
PR 25510 merged steve.dower, 2021-04-21 22:40
PR 25512 merged miss-islington, 2021-04-21 23:18
PR 25513 merged miss-islington, 2021-04-21 23:18
Messages (27)
msg259078 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 00:58
On Windows, socket.gethostbyaddr() must decode the hostname from the ANSI code page, not from UTF-8. See for example this issue:
https://bugs.python.org/issue26226#msg259077

Attached patch changes the socket module to decode hostnames from the ANSI code page on Windows.

See also issues #9377, #16652 and #5004.
msg259085 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-01-28 02:34
Might be nice to switch the socket APIs to the Unicode ones universally. That would also clear up a range of deprecation warnings on build.
msg259092 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016-01-28 05:23
FWIW this patch doesn't fix the test_httpservers failure (or any other) in #26226
msg259106 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-01-28 07:55
The patch is missing the "errors" parameter of PyUnicode_DecodeLocale. But it should call PyUnicode_DecodeMBCS instead. In the "C" locale, PyUnicode_DecodeLocale is Latin-1 because the CRT mbstowcs just casts the values to wchar_t.

socket_getnameinfo also decodes as UTF-8:

    >>> socket.getnameinfo(('127.0.0.1', 20), 0)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte

Steve, does your suggestion include reimplementing socket.gethostbyaddr and socket.gethostbyname_ex using GetNameInfoW and GetAddrInfoW? gethostbyaddr and gethostbyname are deprecated and lack a Unicode implementation.
msg259108 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 08:48
> The patch is missing the "errors" parameter of PyUnicode_DecodeLocale.

Woops, I shouldn't write patch in the middle of the night :-) Hopefully, I didn't push it :-) PyUnicode_DecodeLocale() should only be used when the encoding depends on the *currenet* value of LC_CTYPE.

Here, the ANSI code page is fine, and so PyUnicode_DecodeFSDefault() should be used instead.

> socket_getnameinfo also decodes as UTF-8

Hum, let met try a new patch. It decodes hostname from the ANSI code page on Windows for:

* socket.getnameinfo()
* socket.gethostbyaddr()
* socket.gethostbyname_ex()

The behaviour on other platforms is unchanged.
msg259111 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-28 09:29
Added comments on Rietveld.
msg259112 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 09:41
> Added comments on Rietveld.

Crap. It's easy to miss a compilation error on extensions :-/

I used "make && ./python -m test -v test_socket" to validate  gethostbyaddr_encoding-2.patch and it succeded.

Maybe we should setup.py to *fail* if an extension failed to be compiled?

New patch should have less typos :-) I also checked for reference leak using ./python -m test -R 3:3 test_socket => no leak.


> Why not use PyUnicode_DecodeFSDefault on all platforms? It is used in
gethostname() on Unix.

I don't know which encoding is the best choice on UNIX. I prefer to move step by step and fix an obvious bug on Windows blocking Émanuel (see his issue #26226). (Émanuel uses Émanuel-PC for its hostname, an non-ASCII hostname ;-))

I guess that UTF-8 works in most cases on UNIX, whereas using the locale encoding can introduce regressions if the hostname is non-ASCII. For example, decoding non-ASCII hostname would fail with LANG=C which forces an ASCII locale encoding.

The issue #9377 proposes a more advanced code to choose the encoding to decode hostnames. Sorry, I didn't follow this issue recently, so I don't know if it proposes to use surrogateescape and/or IDNA.

I prefer to discuss the encoding used on UNIX in a new issue (or better continue the existing discussion on issue #9377?).
msg259113 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 09:42
By the way, thanks for your reviews. Code review rocks ;-)
msg259129 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-01-28 14:05
I couldn't remember the names of the alternate functions Windows provides to do the encoding for you, but yes. There are socket APIs there that do encoding and handle memory allocation more safely.

Apart from bugs like this, it's not really urgent and it requires someone motivated to do it. Might be a good project for someone at the PyCon sprints.
msg259131 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016-01-28 14:12
Yes, it's not all that urgent. And Victor's latest patch doesn't work, either :(

I wonder if there's a way to (temporarily) modify the output of ``socket.gethostname()`` to be able to test such weird corner cases.
msg259133 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 14:23
> Yes, it's not all that urgent. And Victor's latest patch doesn't work, either :(

Could you please elaborate? The patch applies cleanly? You rebuild the socket module? Which error message do you get?
msg259134 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016-01-28 14:41
Oh, sorry. The patch applies without any problem, then I re-compile everything and run, and the same error happens. I re-compiled just now to make double sure.
msg259135 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-01-28 14:45
New changeset 0681f0a1fe6e by Victor Stinner in branch '3.5':
Windows: Decode hostname from ANSI code page
https://hg.python.org/cpython/rev/0681f0a1fe6e

New changeset 26f6d8cc2749 by Victor Stinner in branch 'default':
Merge 3.5: Issue #26227
https://hg.python.org/cpython/rev/26f6d8cc2749
msg259136 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 14:48
> Oh, sorry. The patch applies without any problem, then I re-compile everything and run, and the same error happens. I re-compiled just now to make double sure.

I tested my patch on Windows. I called my computer héça (3 non-ASCII letters!). Without the patch, I get the UTF-8 decoding error, as expected. With the patch, it gets the nice "héça" Unicode string, correctly decoded. I tested socket.getfqdn().

My patch will not fix all your issues at once :-) In the issue #26226, I saw at least 3 different bugs. But I'm now sure that my patch fixes a real bug, so I pushed it to Python 3.5 and default (3.6).

Thanks for the bug report Emanuel ;-)
msg259137 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016-01-28 14:51
If it worked for you, I assume it's fine and I probably did something wrong on my side. Thanks for the fix!
msg259138 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 14:51
Steve:
"""
I couldn't remember the names of the alternate functions Windows provides to do the encoding for you, but yes. There are socket APIs there that do encoding and handle memory allocation more safely.

Apart from bugs like this, it's not really urgent and it requires someone motivated to do it. Might be a good project for someone at the PyCon sprints.
"""

Yeah, using the native Windows API is better, it gives access to the full Unicode character set. But it requires to spend time on the C code, and *I* am not interested to work on such project.

If you are motived, please open a new issue for that. If you are not motivated, I'm not sure that it's worth to open a bug report. Using an hostname not encodable to the ANSI code page would probably cause serious issues (not in Python, but in other applications).

When I played with filenames non-encodable to the ANSI code page, I also get errors from multiple applications, whereas Python now uses the native Windows API to access the filesystem. So sometimes Python is better than some other applications, sometimes it's as good :-)
msg259208 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016-01-29 13:25
For future reference, Victor's patch does fix it, I was checking the wrong thing when testing.
msg259212 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-29 16:07
He he, no problem. Thanks again for the bug report. I'm surprised that
nobody reported it before.
msg358409 - (view) Author: William Dias (williamdias) Date: 2019-12-15 00:28
Shouldn't this issue be solved for Python  3.7.5? Or do I have to manually apply the patch?

I have a windows 8.1 x64 PC whose hostname contains special characters. When creating a socket, the gethostbyaddr() method raises a UnicodeDecodeError: 'utf-8' codec can't decode byt 0xe1 in position 1.

Let me know if you need more information.

Thanks
msg363709 - (view) Author: Владимир Мартьянов (Владимир Мартьянов) Date: 2020-03-09 10:05
I have Python 3.7.4 and have the same exception:
print (socket.gethostbyaddr("127.0.0.1"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 6: invalid continuation byte

My OS is Win7 and my computer name contains cyrillic characters.
msg363710 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-03-09 10:15
Looks like what was actually applied was changed to PyUnicode_DecodeFSDefault, which was later changed on Windows to be always UTF-8.

They'll need to be normalised throughout Modules/socketmodule.c (I can't tell at a quick glance which need updating). We should also figure out some kind of test that can catch this sort of issue.

Sorry for guiding you wrong a few years ago, Victor!
msg363712 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-03-09 10:17
If someone wants to do the extra work to use the native Windows socket functions on Windows instead of the POSIX-ey wrappers, that should happen under a new issue. This bug is for a regression that we ought to fix back as far as we can.
msg364114 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-13 17:31
sock_decode_hostname() of socketmodule.c currently uses PyUnicode_DecodeFSDefault() on Windows. PyUnicode_DecodeFSDefault() uses UTF-8 by default (PEP 529).

I understand that the ANSI code page should be used instead of UTF-8.

Would it work to use PyUnicode_DecodeLocale(name, "surrogatepass")? It's implemented with mbstowcs(), but I don't recall which encoding it uses on Windows.

Or can we use PyUnicode_DecodeMBCS(name, strlen(name), "surrogatepass")?

--

I understand that setting PYTHONLEGACYWINDOWSFSENCODING environment variable to 1 should work around the issue.
msg364116 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-03-13 18:28
I think PyUnicode_DecodeMBCS(name, strlen(name), "surrogatepass") captures the intention better, and is less likely to break in the future (apart from all the ways it's currently broken :) )

You should be right about the workaround too.
msg391559 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-04-21 23:18
New changeset dc516ef8395d15da0ab225eb0dceb2e0581f51ca by Steve Dower in branch 'master':
bpo-26227: Fixes decoding of host names on Windows from ANSI instead of UTF-8 (GH-25510)
https://github.com/python/cpython/commit/dc516ef8395d15da0ab225eb0dceb2e0581f51ca
msg391563 - (view) Author: miss-islington (miss-islington) Date: 2021-04-21 23:36
New changeset f7bc44170b4bdd16c46b4b6acf7673ffc24dfb19 by Miss Islington (bot) in branch '3.8':
bpo-26227: Fixes decoding of host names on Windows from ANSI instead of UTF-8 (GH-25510)
https://github.com/python/cpython/commit/f7bc44170b4bdd16c46b4b6acf7673ffc24dfb19
msg391564 - (view) Author: miss-islington (miss-islington) Date: 2021-04-21 23:43
New changeset d8576b1d15155688a67baac24c15254700bdd3b7 by Miss Islington (bot) in branch '3.9':
bpo-26227: Fixes decoding of host names on Windows from ANSI instead of UTF-8 (GH-25510)
https://github.com/python/cpython/commit/d8576b1d15155688a67baac24c15254700bdd3b7
History
Date User Action Args
2022-04-11 14:58:26adminsetgithub: 70415
2021-04-21 23:43:43miss-islingtonsetmessages: + msg391564
2021-04-21 23:36:43miss-islingtonsetmessages: + msg391563
2021-04-21 23:23:53steve.dowersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-04-21 23:18:40miss-islingtonsetpull_requests: + pull_request24231
2021-04-21 23:18:31miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request24230
2021-04-21 23:18:23steve.dowersetmessages: + msg391559
2021-04-21 22:41:49steve.dowersetassignee: steve.dower
2021-04-21 22:40:35steve.dowersetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request24228
2021-03-29 12:33:11vstinnersetnosy: - vstinner
2021-03-28 03:26:49eryksunsetkeywords: + 3.6regression, 3.7regression, 3.8regression, 3.9regression, 3.10regression, - patch
stage: test needed -> needs patch
type: crash -> behavior
components: + Library (Lib)
versions: + Python 3.10, - Python 3.7
2020-11-30 19:52:03steve.dowerlinkissue42495 superseder
2020-03-13 18:28:19steve.dowersetmessages: + msg364116
2020-03-13 17:31:04vstinnersetmessages: + msg364114
2020-03-09 10:17:33steve.dowersetpriority: normal -> high

messages: + msg363712
2020-03-09 10:15:36steve.dowersetstatus: closed -> open
versions: + Python 3.8, Python 3.9
messages: + msg363710

resolution: fixed -> (no value)
stage: test needed
2020-03-09 10:05:34Владимир Мартьяновsetnosy: + Владимир Мартьянов
messages: + msg363709
2019-12-15 00:28:14williamdiassetversions: + Python 3.7, - Python 3.5, Python 3.6
nosy: + williamdias

messages: + msg358409

type: crash
2016-01-29 16:07:49vstinnersetmessages: + msg259212
2016-01-29 13:25:45abarrysetmessages: + msg259208
2016-01-28 14:51:42vstinnersetmessages: + msg259138
2016-01-28 14:51:39abarrysetmessages: + msg259137
2016-01-28 14:48:17vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg259136
2016-01-28 14:45:53python-devsetnosy: + python-dev
messages: + msg259135
2016-01-28 14:41:39abarrysetmessages: + msg259134
2016-01-28 14:23:18vstinnersetmessages: + msg259133
2016-01-28 14:12:10abarrysetmessages: + msg259131
2016-01-28 14:05:35steve.dowersetmessages: + msg259129
2016-01-28 09:42:57vstinnersetmessages: + msg259113
2016-01-28 09:41:03vstinnersetfiles: + gethostbyaddr_encoding-3.patch

messages: + msg259112
2016-01-28 09:29:11serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg259111
2016-01-28 08:49:24vstinnersetfiles: + gethostbyaddr_encoding-2.patch
2016-01-28 08:48:27vstinnersetmessages: + msg259108
2016-01-28 07:55:38eryksunsetnosy: + eryksun
messages: + msg259106
2016-01-28 05:23:51abarrysetmessages: + msg259092
2016-01-28 02:34:48steve.dowersetmessages: + msg259085
2016-01-28 00:59:13vstinnersetfiles: + gethostbyaddr_encoding.patch
keywords: + patch
2016-01-28 00:58:50vstinnercreate