New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names #53623
Comments
The functions in the socket module which return host/domain Some DNS resolvers do discard hostnames not matching the Currently, names are converted to str objects using 127.0.0.2 € in /etc/hosts, I get: Python 3.1.2 (r312:79147, Mar 23 2010, 19:02:21)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> from socket import *
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))] Here, getaddrinfo() has encoded "€" to its corresponding ACE PEP-383 can't be applied as-is here, since if the name happened Surrogate characters are not allowed in IDNs, since they are The returned strings could then be made to round-trip as The patch try-surrogateescape-first.diff implements the above for The patch still allows hostnames to be passed as bytes objects, One problem with the surrogateescape scheme would be with On the other hand, such code is probably broken as things stand, An alternative approach might be to return all hostnames as bytes [1] http://tools.ietf.org/html/rfc3491#section-5 |
I like the idea of using the PEP-383 for hostnames, but I don't understand the relation with IDNA (maybe because I don't know this encoding). +this leaves IDNA ASCII-compatible encodings in ASCII What is an "IDNA ASCII-compatible encoding"? -- ascii-surrogateescape.diff:
try-surrogateescape-first.diff:
|
"Leaving IDNA ASCII-compatible encodings in ASCII form" is just preserving the existing behaviour (not doing IDNA decoding). See http://tools.ietf.org/html/rfc3490 and the docs for codecs -> encodings.idna ("xn--lzg" in the example is the ASCII-compatible encoding of "€", so if you look up that IP address, "xn--lzg" is returned with or without the patch). I'll look into your other comments. In the meantime, I've got one more patch, as the decoding of the nodename field in os.uname() also needs to be changed to match the other hostname-returning functions. This patch changes it to ASCII/surrogateescape, with the usual PEP-383 decoding for the other fields. |
OK, here are new versions of the original patches. I've tweaked the docs to make clear that ASCII-compatible You're right that PyUnicode_AsEncodedString() is the preferable However, the PyBytes_Check isn't to check up on the codec, but to I've also made the converter check for UnicodeEncodeError in the I think the example I gave in the previous comment was also In /etc/hosts (in UTF-8 encoding): 127.0.0.2 € Without patches: >>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]
>>> '€'.encode("idna")
b'xn--lzg' With patches: >>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('\udce2\udc82\udcac', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.2', 0)), (2, 2, 17, '', ('127.0.0.2', 0)), (2, 3, 0, '', ('127.0.0.2', 0))]
>>> '\udce2\udc82\udcac'.encode("idna")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/home/david/python-patches/python-3/Lib/encodings/idna.py",
line 167, in encode
result.extend(ToASCII(label))
File
"/home/david/python-patches/python-3/Lib/encodings/idna.py",
line 76, in ToASCII
label = nameprep(label)
File
"/home/david/python-patches/python-3/Lib/encodings/idna.py",
line 38, in nameprep
raise UnicodeError("Invalid character %r" % c)
UnicodeError: Invalid character '\udce2' The exception at the end demonstrates why surrogateescape strings |
I noticed that try-surrogateescape-first.diff missed out one of |
Is this patch in response to an actual problem, or a theoretical problem? If theoretical, I recommend to close it as "won't fix". I find it perfectly reasonable if Python's socket module gives an error if the hostname can't be clearly decoded. Applications that run into it as a result of gethostbyaddr should treat that as "no reverse name available". |
It's about environments, not applications - the local network may
There are two points here. One is that the decoding can fail; I The other is that the encoding and decoding are not symmetric - Attaching a refreshed version of try-surrogateescape-first.diff. |
Still, my question remains. Is it a theoretical problem (i.e. one
True. However, I think this is an acceptable regression,
Again, I fail to see the problem in this. It won't happen in |
Yes, I did reproduce the problem on my own system (Ubuntu 8.04). I reported this bug to save anyone who *is* in such an
That would be an improvement. The idea of the patches I posted |
The surrogateescape mechanism is a very hackish approach, and |
I don't see how a name resolution API returning non-ASCII bytes What is hackish is representing char * data as a Unicode string
But to be more explicit, that's like saying "if it hurts, get |
It's in violation of RFC 952 (slightly relaxed by RFC 1123).
Which I consider perfectly reasonable. The sysadmin should have |
That's bad if it's on the public Internet, but it's not an If you look at POSIX, you'll see that what getaddrinfo() and
It's not reasonable when addressed to a customer who might go |
I remain -1 on this change, until such a customer actually shows |
OK, I still think this issue should be addressed, but here is a patch for the part we agree on: that decoding should not return any Unicode characters except ASCII. |
The rest of the issue could also be straightforwardly addressed by adding bytes versions of the name lookup APIs. Attaching a patch which does that (applies on top of decode-strict-ascii.diff). |
Oops, forgot to refresh the last change into that patch. This should fix it. |
platform.system() fails with UnicodeEncodeError on systems that have their computer name set to a name containing non-ascii characters. The implementation of platform.system() uses at some point socket.gethostname() ( see http://www.pasteall.org/16215 for a stacktrace of such usage) There are a lot of our Blender users that are not english native-speakers and they set up their machine as they please, against RCFs or not. This currently breaks some code that use platform.system() to check the system it's run on. The paste from above is from a user who has named his computer Nötkötti. It would be more than great if this error could be fixed. If another 3.1 release is planned, preferrably for that. |
This trace is from a Windows system, where the platform module Given that os.uname() is a primary source of information about
If you'd like to try the surrogateescape patches, they ought to |
The failure of platform.uname is an independent bug. IMO, it shouldn't use socket.gethostname on Windows, but instead look at the COMPUTERNAME environment variable or call the GetComputerName API function. This is more close to what uname() does on Unix (i.e. retrieve the local machine name independent of DNS). I have created bpo-10097 for this bug. |
I disagree with the proposal - it should return whatever
No no no. When Microsoft calls it the DNS name, they don't actually |
Martin v. Löwis wrote:
Just to clarify: I was proposing to use the I don't understand why Martin insists that the MS "DNS name" http://msdn.microsoft.com/en-us/library/ms724301(v=VS.85).aspx As I said earlier: NetBIOS is being phased out in favor of http://msdn.microsoft.com/en-us/library/ms724931(v=VS.85).aspx Perhaps Martin could clarify why he insists on using the PS: WinSock provides many other Unicode APIs for socket On other platforms, I guess we'll just have to do some trial # hostname l\303\266wis Using the IDNA version doesn't help either: # hostname xn--lwis-5qa Python2 happily returns the host name, but fails to return 'l\xc3\xb6wis'
>>> socket.getfqdn()
'l\xc3\xb6wis' and 'xn--lwis-5qa'
>>> socket.getfqdn()
'xn--lwis-5qa' Just for comparison: # hostname newton and 'newton'
>>> socket.getfqdn()
'newton.egenix.internal' So at least on Linux, using non-ASCII hostnames doesn't really |
I think what's happening here is that simply that you're setting It works for me if I add "127.0.0.9 löwis.egenix.com löwis" to |
Looks like we have our first customer (bpo-10223). |
I just did an experiment on Windows 7. I used SetComputerNameEx to set the NetBIOS name (4) to "e2718", and the DNS name (5) to "π3141"; then I rebooted. This is on a system with windows-1252 as its ANSI code page (i.e. u"π"==u"\N{GREEK SMALL LETTER PI}" is not in the ANSI charset. After the reboot, I found
So my theory of how this all fits together is this:
In summary, I (now) think it's fine to return the Unicode version of the DNS name from gethostname on Windows. Re msg119271: the name "π3141" really has nothing to do with the DNS on my system. It doesn't occur in DNS any zone, nor could it possibly. It's unclear to me why Microsoft calls it the "DNS name". |
r85934 now uses GetComputerNameExW on Windows. |
Martin v. Löwis wrote:
The MS docs mention that setting the DNS name will adjust the NetBIO name They don't mention anything about the NetBIOS name encoding.
The DNS name of the Windows machine is the combination of the DNS host Of course, it's not particularly useful to set the DNS name to FWIW, you can do the same on a Linux box, i.e. setup the host name |
Martin v. Löwis wrote:
Thanks, Martin. Here's a similar discussion of the Windows approach (used in bzr): https://bugs.launchpad.net/bzr/+bug/256550/comments/6 This is what Solaris uses: http://developers.sun.com/dev/gadc/faq/locale.html#get-set (they require conversion to ASCII and using IDNA for non-ASCII I found this RFC draft on the topic: ASCII, UTF-8 and IDNA are happily mixed and matched. |
The Solaris case then is already supported, with no change required: if Solaris bans non-ASCII in the network configuration (or, rather, recommends to use IDNA), then this will work fine with the current code. The Josefsson AI_IDN flag is irrelevant to Python, IMO: it treats byte names as locale-encoded, and converts them with IDNA. Python 3 users really should use Unicode strings in the first place for non-ASCII data, in which case the socket.getaddrinfo uses IDNA, anyway. However, it can't hurt to expose this flag if the underlying C library supports it. AI_CANONIDN might be interesting to implement, but I'd rather wait whether this finds RFC approval. In any case, undoing IDNA is orthogonal to this issue (which is about non-ASCII data returned from the socket API). If anything needs to be done on Unix, I think that the gethostname result should be decoded using the file system encoding; I then don't mind using surrogate escape there for good measure. This won't hurt systems that restrict host names to ASCII, and may do some good for systems that don't. |
Martin v. Löwis wrote:
Wouldn't it be better to also attempt to decode the name using IDNA This would then also cover the Solaris case. |
I don't assume that - I merely point it that it clearly has no
Yes, but Linux (rightly) calls it the "hostname", not the "DNS name". |
Perhaps better - but incompatible. I don't see a way to have the |
The code in socketmodule.c currently compile with suspect warnings: socketmodule.c(3108) : warning C4047: 'function' : 'LPSTR' differs in levels of indirection from 'int' was GetComputerName() used instead of GetComputerNameExW()? |
Just to clarify here: there isn't anything special about http://www.kernel.org/doc/man-pages/online/pages/man5/nsswitch.conf.5.html It's an extensible system, so people can write their own modules |
I faced with the issue on my own PC. For a Russian version of WinOS default PC name is ИВАН-ПК (C8 C2 C0 CD 2D CF CA in hex) and it returns from gethostbyaddr (CRT) exactly in this form (encoded with system locale cp1251 not UTF8). So when the function PyUnicode_FromString is called, it expects that argument is utf8 encoded string and throws and error. |
Nick, which version of Python are you using? And which function are you running exactly? |
Originally I tried 3.2.2 (32bit), but I've just checked 3.2.3 and got the same. from socket import gethostbyaddr
a = gethostbyaddr('127.0.0.1') leads to:
Traceback (most recent call last):
File "C:\Users\user\test\test.py", line 13, in <module>
a = gethostbyaddr('127.0.0.1')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 5: invalid continuation byte Or more complex sample: def main():
import http.server
port = 80
handlerClass = http.server.SimpleHTTPRequestHandler
srv = http.server.HTTPServer(("", port), handlerClass )
srv.serve_forever()
if __name__ == "__main__":
main() Attempt of connection to the server leads to: ---------------------------------------- Exception happened during processing of request from ('127.0.0.1', 1156)
Traceback (most recent call last):
File "C:\Python32\lib\socketserver.py", line 284, in _handle_request_noblock
self.process_request(request, client_address)
File "C:\Python32\lib\socketserver.py", line 310, in process_request
self.finish_request(request, client_address)
File "C:\Python32\lib\socketserver.py", line 323, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "C:\Python32\lib\socketserver.py", line 637, in __init__
self.handle()
File "C:\Python32\lib\http\server.py", line 396, in handle
self.handle_one_request()
File "C:\Python32\lib\http\server.py", line 384, in handle_one_request
method()
File "C:\Python32\lib\http\server.py", line 657, in do_GET
f = self.send_head()
File "C:\Python32\lib\http\server.py", line 701, in send_head
self.send_response(200)
File "C:\Python32\lib\http\server.py", line 438, in send_response
self.log_request(code)
File "C:\Python32\lib\http\server.py", line 483, in log_request
self.requestline, str(code), str(size))
File "C:\Python32\lib\http\server.py", line 517, in log_message
(self.address_string(),
File "C:\Python32\lib\http\server.py", line 559, in address_string
return socket.getfqdn(host)
File "C:\Python32\lib\socket.py", line 355, in getfqdn
hostname, aliases, ipaddrs = gethostbyaddr(name)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 5: invalid continuation byte P.S. My PC name is "USER-ПК" |
a4fd3dc74299 only fixed socket.gethostname(), not socket.gethostbyaddr(). |
For Windows versions that support it, we could use GetNameInfoW, available on XPSP2+, W2k3+ and Vista+. The questions then are: what to do about gethostbyaddr, and what to do about the general case? Since the problem appears to be specific to Windows, it might be appropriate to find a solution to just the Windows case, and ignore the general issue. For gethostbyaddr, decoding would then use CP_ACP. |
I'd add that this bug is very practical and can render a lot of software unusable/noisy/confusing on Windows, including Django (I discovered this bug when mentoring on Django Girls]. The simple step to reproduce is to take any windows and set regional settings to non-English (I've used Czech). You can verify that using "import locale; locale.getpreferredencoding()", that should display something else ("cp1250" in my case). Then, set "name" (= hostname, in Windows settings) of the computer to anything containing non-ascii character (like "Didejo-noťas"). As Windows apparently encodes the hostname using their default encoding, it fails with
|
I've updated the ASCII/surrogateescape patches in line with return-ascii-surrogateescape-2015-06-25.diff incorporates the Python's existing code now has a fast path for ASCII-only strings ASCII/strict (existing code, fast path) rather than the previous ASCII/surrogateescape This doesn't change the behaviour of the patch, since IDNA always These patches would at least allow getfqdn() to work in Almad's (That isn't guaranteed in Unix environments of course, which is |
FYI I created the issue bpo-26227 to change the encoding used to decode hostnames on Windows. UTF-8 doesn't seem to be the right encoding, it fails on non-ASCII hostnames. I propose to use the ANSI code page. Sorry, I didn't read this issue, but it looks like IDNA isn't the good encoding to decode hostnames *on Windows*. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: