Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: socket.gethostbyaddr(name) fails for non-ASCII hostname #70415

Closed
vstinner opened this issue Jan 28, 2016 · 27 comments
Closed

Windows: socket.gethostbyaddr(name) fails for non-ASCII hostname #70415

vstinner opened this issue Jan 28, 2016 · 27 comments
Assignees
Labels
3.8 only security fixes 3.9 only security fixes 3.10 only security fixes OS-windows stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

BPO 26227
Nosy @pfmoore, @tjguk, @ezio-melotti, @zware, @serhiy-storchaka, @eryksun, @zooba, @Vgr255, @miss-islington
PRs
  • bpo-26227: Fixes decoding of host names on Windows from ANSI instead of UTF-8 #25510
  • [3.9] bpo-26227: Fixes decoding of host names on Windows from ANSI instead of UTF-8 (GH-25510) #25512
  • [3.8] bpo-26227: Fixes decoding of host names on Windows from ANSI instead of UTF-8 (GH-25510) #25513
  • Files
  • gethostbyaddr_encoding.patch
  • gethostbyaddr_encoding-2.patch
  • gethostbyaddr_encoding-3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/zooba'
    closed_at = <Date 2021-04-21.23:23:53.015>
    created_at = <Date 2016-01-28.00:58:50.855>
    labels = ['type-bug', '3.8', 'OS-windows', '3.10', 'library', 'expert-unicode', '3.9']
    title = 'Windows: socket.gethostbyaddr(name) fails for non-ASCII hostname'
    updated_at = <Date 2021-04-21.23:43:43.773>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2021-04-21.23:43:43.773>
    actor = 'miss-islington'
    assignee = 'steve.dower'
    closed = True
    closed_date = <Date 2021-04-21.23:23:53.015>
    closer = 'steve.dower'
    components = ['Library (Lib)', 'Unicode', 'Windows']
    creation = <Date 2016-01-28.00:58:50.855>
    creator = 'vstinner'
    dependencies = []
    files = ['41734', '41740', '41741']
    hgrepos = []
    issue_num = 26227
    keywords = ['patch', '3.6regression', '3.7regression', '3.8regression', '3.9regression', '3.10regression']
    message_count = 27.0
    messages = ['259078', '259085', '259092', '259106', '259108', '259111', '259112', '259113', '259129', '259131', '259133', '259134', '259135', '259136', '259137', '259138', '259208', '259212', '358409', '363709', '363710', '363712', '364114', '364116', '391559', '391563', '391564']
    nosy_count = 12.0
    nosy_names = ['paul.moore', 'tim.golden', 'ezio.melotti', 'python-dev', 'zach.ware', 'serhiy.storchaka', 'eryksun', 'steve.dower', 'abarry', 'miss-islington', 'williamdias', '\xd0\x92\xd0\xbb\xd0\xb0\xd0\xb4\xd0\xb8\xd0\xbc\xd0\xb8\xd1\x80 \xd0\x9c\xd0\xb0\xd1\x80\xd1\x82\xd1\x8c\xd1\x8f\xd0\xbd\xd0\xbe\xd0\xb2']
    pr_nums = ['25510', '25512', '25513']
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue26227'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    @vstinner
    Copy link
    Member Author

    On Windows, socket.gethostbyaddr() must decode the hostname from the ANSI code page, not from UTF-8. See for example this issue:
    https://bugs.python.org/issue26226#msg259077

    Attached patch changes the socket module to decode hostnames from the ANSI code page on Windows.

    See also issues bpo-9377, bpo-16652 and bpo-5004.

    @zooba
    Copy link
    Member

    zooba commented Jan 28, 2016

    Might be nice to switch the socket APIs to the Unicode ones universally. That would also clear up a range of deprecation warnings on build.

    @Vgr255
    Copy link
    Mannequin

    Vgr255 mannequin commented Jan 28, 2016

    FWIW this patch doesn't fix the test_httpservers failure (or any other) in bpo-26226

    @eryksun
    Copy link
    Contributor

    eryksun commented Jan 28, 2016

    The patch is missing the "errors" parameter of PyUnicode_DecodeLocale. But it should call PyUnicode_DecodeMBCS instead. In the "C" locale, PyUnicode_DecodeLocale is Latin-1 because the CRT mbstowcs just casts the values to wchar_t.

    socket_getnameinfo also decodes as UTF-8:

        >>> socket.getnameinfo(('127.0.0.1', 20), 0)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte

    Steve, does your suggestion include reimplementing socket.gethostbyaddr and socket.gethostbyname_ex using GetNameInfoW and GetAddrInfoW? gethostbyaddr and gethostbyname are deprecated and lack a Unicode implementation.

    @vstinner
    Copy link
    Member Author

    The patch is missing the "errors" parameter of PyUnicode_DecodeLocale.

    Woops, I shouldn't write patch in the middle of the night :-) Hopefully, I didn't push it :-) PyUnicode_DecodeLocale() should only be used when the encoding depends on the *currenet* value of LC_CTYPE.

    Here, the ANSI code page is fine, and so PyUnicode_DecodeFSDefault() should be used instead.

    socket_getnameinfo also decodes as UTF-8

    Hum, let met try a new patch. It decodes hostname from the ANSI code page on Windows for:

    • socket.getnameinfo()
    • socket.gethostbyaddr()
    • socket.gethostbyname_ex()

    The behaviour on other platforms is unchanged.

    @serhiy-storchaka
    Copy link
    Member

    Added comments on Rietveld.

    @vstinner
    Copy link
    Member Author

    Added comments on Rietveld.

    Crap. It's easy to miss a compilation error on extensions :-/

    I used "make && ./python -m test -v test_socket" to validate gethostbyaddr_encoding-2.patch and it succeded.

    Maybe we should setup.py to *fail* if an extension failed to be compiled?

    New patch should have less typos :-) I also checked for reference leak using ./python -m test -R 3:3 test_socket => no leak.

    Why not use PyUnicode_DecodeFSDefault on all platforms? It is used in
    gethostname() on Unix.

    I don't know which encoding is the best choice on UNIX. I prefer to move step by step and fix an obvious bug on Windows blocking Émanuel (see his issue bpo-26226). (Émanuel uses Émanuel-PC for its hostname, an non-ASCII hostname ;-))

    I guess that UTF-8 works in most cases on UNIX, whereas using the locale encoding can introduce regressions if the hostname is non-ASCII. For example, decoding non-ASCII hostname would fail with LANG=C which forces an ASCII locale encoding.

    The issue bpo-9377 proposes a more advanced code to choose the encoding to decode hostnames. Sorry, I didn't follow this issue recently, so I don't know if it proposes to use surrogateescape and/or IDNA.

    I prefer to discuss the encoding used on UNIX in a new issue (or better continue the existing discussion on issue bpo-9377?).

    @vstinner
    Copy link
    Member Author

    By the way, thanks for your reviews. Code review rocks ;-)

    @zooba
    Copy link
    Member

    zooba commented Jan 28, 2016

    I couldn't remember the names of the alternate functions Windows provides to do the encoding for you, but yes. There are socket APIs there that do encoding and handle memory allocation more safely.

    Apart from bugs like this, it's not really urgent and it requires someone motivated to do it. Might be a good project for someone at the PyCon sprints.

    @Vgr255
    Copy link
    Mannequin

    Vgr255 mannequin commented Jan 28, 2016

    Yes, it's not all that urgent. And Victor's latest patch doesn't work, either :(

    I wonder if there's a way to (temporarily) modify the output of socket.gethostname() to be able to test such weird corner cases.

    @vstinner
    Copy link
    Member Author

    Yes, it's not all that urgent. And Victor's latest patch doesn't work, either :(

    Could you please elaborate? The patch applies cleanly? You rebuild the socket module? Which error message do you get?

    @Vgr255
    Copy link
    Mannequin

    Vgr255 mannequin commented Jan 28, 2016

    Oh, sorry. The patch applies without any problem, then I re-compile everything and run, and the same error happens. I re-compiled just now to make double sure.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 28, 2016

    New changeset 0681f0a1fe6e by Victor Stinner in branch '3.5':
    Windows: Decode hostname from ANSI code page
    https://hg.python.org/cpython/rev/0681f0a1fe6e

    New changeset 26f6d8cc2749 by Victor Stinner in branch 'default':
    Merge 3.5: Issue bpo-26227
    https://hg.python.org/cpython/rev/26f6d8cc2749

    @vstinner
    Copy link
    Member Author

    Oh, sorry. The patch applies without any problem, then I re-compile everything and run, and the same error happens. I re-compiled just now to make double sure.

    I tested my patch on Windows. I called my computer héça (3 non-ASCII letters!). Without the patch, I get the UTF-8 decoding error, as expected. With the patch, it gets the nice "héça" Unicode string, correctly decoded. I tested socket.getfqdn().

    My patch will not fix all your issues at once :-) In the issue bpo-26226, I saw at least 3 different bugs. But I'm now sure that my patch fixes a real bug, so I pushed it to Python 3.5 and default (3.6).

    Thanks for the bug report Emanuel ;-)

    @Vgr255
    Copy link
    Mannequin

    Vgr255 mannequin commented Jan 28, 2016

    If it worked for you, I assume it's fine and I probably did something wrong on my side. Thanks for the fix!

    @vstinner
    Copy link
    Member Author

    Steve:
    """
    I couldn't remember the names of the alternate functions Windows provides to do the encoding for you, but yes. There are socket APIs there that do encoding and handle memory allocation more safely.

    Apart from bugs like this, it's not really urgent and it requires someone motivated to do it. Might be a good project for someone at the PyCon sprints.
    """

    Yeah, using the native Windows API is better, it gives access to the full Unicode character set. But it requires to spend time on the C code, and *I* am not interested to work on such project.

    If you are motived, please open a new issue for that. If you are not motivated, I'm not sure that it's worth to open a bug report. Using an hostname not encodable to the ANSI code page would probably cause serious issues (not in Python, but in other applications).

    When I played with filenames non-encodable to the ANSI code page, I also get errors from multiple applications, whereas Python now uses the native Windows API to access the filesystem. So sometimes Python is better than some other applications, sometimes it's as good :-)

    @Vgr255
    Copy link
    Mannequin

    Vgr255 mannequin commented Jan 29, 2016

    For future reference, Victor's patch does fix it, I was checking the wrong thing when testing.

    @vstinner
    Copy link
    Member Author

    He he, no problem. Thanks again for the bug report. I'm surprised that
    nobody reported it before.

    @williamdias
    Copy link
    Mannequin

    williamdias mannequin commented Dec 15, 2019

    Shouldn't this issue be solved for Python 3.7.5? Or do I have to manually apply the patch?

    I have a windows 8.1 x64 PC whose hostname contains special characters. When creating a socket, the gethostbyaddr() method raises a UnicodeDecodeError: 'utf-8' codec can't decode byt 0xe1 in position 1.

    Let me know if you need more information.

    Thanks

    @williamdias williamdias mannequin added 3.7 (EOL) end of life type-crash A hard crash of the interpreter, possibly with a core dump labels Dec 15, 2019
    @ghost
    Copy link

    ghost commented Mar 9, 2020

    I have Python 3.7.4 and have the same exception:
    print (socket.gethostbyaddr("127.0.0.1"))
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 6: invalid continuation byte

    My OS is Win7 and my computer name contains cyrillic characters.

    @zooba
    Copy link
    Member

    zooba commented Mar 9, 2020

    Looks like what was actually applied was changed to PyUnicode_DecodeFSDefault, which was later changed on Windows to be always UTF-8.

    They'll need to be normalised throughout Modules/socketmodule.c (I can't tell at a quick glance which need updating). We should also figure out some kind of test that can catch this sort of issue.

    Sorry for guiding you wrong a few years ago, Victor!

    @zooba zooba added 3.8 only security fixes 3.9 only security fixes labels Mar 9, 2020
    @zooba zooba reopened this Mar 9, 2020
    @zooba
    Copy link
    Member

    zooba commented Mar 9, 2020

    If someone wants to do the extra work to use the native Windows socket functions on Windows instead of the POSIX-ey wrappers, that should happen under a new issue. This bug is for a regression that we ought to fix back as far as we can.

    @vstinner
    Copy link
    Member Author

    sock_decode_hostname() of socketmodule.c currently uses PyUnicode_DecodeFSDefault() on Windows. PyUnicode_DecodeFSDefault() uses UTF-8 by default (PEP-529).

    I understand that the ANSI code page should be used instead of UTF-8.

    Would it work to use PyUnicode_DecodeLocale(name, "surrogatepass")? It's implemented with mbstowcs(), but I don't recall which encoding it uses on Windows.

    Or can we use PyUnicode_DecodeMBCS(name, strlen(name), "surrogatepass")?

    --

    I understand that setting PYTHONLEGACYWINDOWSFSENCODING environment variable to 1 should work around the issue.

    @zooba
    Copy link
    Member

    zooba commented Mar 13, 2020

    I think PyUnicode_DecodeMBCS(name, strlen(name), "surrogatepass") captures the intention better, and is less likely to break in the future (apart from all the ways it's currently broken :) )

    You should be right about the workaround too.

    @eryksun eryksun added 3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error and removed 3.7 (EOL) end of life type-crash A hard crash of the interpreter, possibly with a core dump labels Mar 28, 2021
    @zooba zooba self-assigned this Apr 21, 2021
    @zooba
    Copy link
    Member

    zooba commented Apr 21, 2021

    New changeset dc516ef by Steve Dower in branch 'master':
    bpo-26227: Fixes decoding of host names on Windows from ANSI instead of UTF-8 (GH-25510)
    dc516ef

    @zooba zooba closed this as completed Apr 21, 2021
    @miss-islington
    Copy link
    Contributor

    New changeset f7bc441 by Miss Islington (bot) in branch '3.8':
    bpo-26227: Fixes decoding of host names on Windows from ANSI instead of UTF-8 (GH-25510)
    f7bc441

    @miss-islington
    Copy link
    Contributor

    New changeset d8576b1 by Miss Islington (bot) in branch '3.9':
    bpo-26227: Fixes decoding of host names on Windows from ANSI instead of UTF-8 (GH-25510)
    d8576b1

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes 3.10 only security fixes OS-windows stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants