classification
Title: socket.gethostbyname incorrectly parses ip
Type: behavior Stage: resolved
Components: Extension Modules, Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, maker, neologix, pitrou, python-dev, santoso.wijaya, vstinner
Priority: normal Keywords: needs review, patch

Created on 2012-10-11 19:31 by maker, last changed 2013-09-14 12:06 by neologix. This issue is now closed.

Files
File name Uploaded Description Edit
gethostbyname_whitespace.patch santoso.wijaya, 2012-10-11 22:07 review
issue16201.patch maker, 2012-10-12 08:13 review
issue16201.1.patch maker, 2012-10-13 08:59 review
issue16201.2.patch maker, 2012-10-13 09:44 review
issue16201.3.patch maker, 2012-10-13 23:27 review
parse_inet.diff neologix, 2013-08-29 15:30 review
parse_inet-1.diff neologix, 2013-09-03 18:54 review
Messages (19)
msg172682 - (view) Author: Michele Orrù (maker) * Date: 2012-10-11 19:31
Buggy due to the use of scanf at Modueles/socketmodule.c:868
Broken from python2.7 to tip on my machine (GNU/Linux)

>>> import socket
[64481 refs]
>>> socket.gethostbyname('4294967306.4294967296.4294967296.1')
'10.0.0.1'
[67764 refs]
>>> socket.gethostbyname('192.168.1.1  ')
'192.168.1.1'
[67764 refs]
>>> socket.gethostbyname('  192.168.1.1')
'192.168.1.1'
[67764 refs]
>>> socket.gethostbyname('  192.168.1.1  ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
[67805 refs]
msg172700 - (view) Author: Santoso Wijaya (santoso.wijaya) * Date: 2012-10-11 22:07
Attaching patch to trim leading and trailing whitespaces prior to processing.

Incidentally, this also means:

>>> socket.gethostbyname('')
'0.0.0.0'
>>> socket.gethostbyname(' ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
socket.gaierror: [Errno -5] No address associated with hostname


But I'm not sure if we're okay with changing that semantics, so I left it be in this patch.
msg172723 - (view) Author: Michele Orrù (maker) * Date: 2012-10-12 07:34
> Attaching patch to trim leading and trailing whitespaces prior to
> processing.
Note that tests are incorrect: the parsing is of the form %d.%d.%d.%d%c, so the parser should accept trailing spaces. That's the same for ping iirc: 

$ ping "192.168.1.1    "
PING 192.168.1.1     (192.168.1.1) 56(84) bytes of data.

--- 192.168.1.1     ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

$ ping " 192.168.1.1"
ping: unknown host  192.168.1.1


I am trying to get out with a simple parser inside the function, probabl y strtol() is the way. 
 Since Python's C API does not provide any sscanf wrapper, I thought about adding a new one. But given that, AFAIS it appears just two times over the entire sourcecode, there is no need IMO for exporting a new one right now. What is your opinion?
msg172787 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-10-12 21:59
> Buggy due to the use of scanf at Modueles/socketmodule.c:868

I don't think so. The following test fails because sscanf() returns 5 instead of 4:

    if (sscanf(name, "%d.%d.%d.%d%c", &d1, &d2, &d3, &d4, &ch) == 4 && ...)

So '192.168.1.1  ' is passed to getaddrinfo().

If you consider that '192.168.1.1  ' is an invalid name, you should report the issue to the vendor of the C library of your OS.

--

On Linux, the "issue" is also present in inet_ntoa():

$ python
Python 2.7.3 (default, Jul 24 2012, 11:41:40) 
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2
>>> import ctypes
>>> buffer=ctypes.create_string_buffer(128)
>>> libc=ctypes.cdll.LoadLibrary("libc.so.6")

>>> libc.inet_aton(b"127.0.0.1", ctypes.byref(buffer))
1
>>> buffer.raw[:4]
'\x7f\x00\x00\x01'

>>> libc.inet_aton(b"127.0.0.2   ", ctypes.byref(buffer))
1
>>> buffer.raw[:4]
'\x7f\x00\x00\x02'

The source code of the inet_aton() function of the GNU libc:

http://sourceware.org/git/?p=glibc.git;a=blob;f=resolv/inet_addr.c;h=144b87a74c1aef62779862e78e5f87e18e66ee9e;hb=HEAD#l108

Website of the GNU libc:
http://www.gnu.org/software/libc/
msg172801 - (view) Author: Michele Orrù (maker) * Date: 2012-10-13 08:59
>
>> Buggy due to the use of scanf at Modueles/socketmodule.c:868
>
> I don't think so. The following test fails because sscanf() returns 5
> instead of 4:
You are right, sorry for didn't notice. 


> If you consider that '192.168.1.1  ' is an invalid name, you should
> report the issue to the vendor of the C library of your OS.
Yep, I'm going to ask on #glibc if there is any specific reason for accepting trailing spaces. But this concerns the specific underlying implementation if getaddrinfo(), so IMO we should just update the int overflow issue for now, maintaining the coherence with other utilities having the same "problem". For this, I am attaching a new patch which which just do not test for trailing whitespaces. 

-- 
ù
msg172802 - (view) Author: Michele Orrù (maker) * Date: 2012-10-13 09:44
Reviewed with Ezio.
msg172825 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-13 19:55
I don't think uint32_t exists everywhere, you should use "unsigned int" instead.
msg172840 - (view) Author: Michele Orrù (maker) * Date: 2012-10-13 23:27
Updated && tested on GNU/Linux (gcc).
msg196304 - (view) Author: Michele Orrù (maker) * Date: 2013-08-27 18:15
Ping.
msg196452 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-08-29 10:07
Stupid question: why use scanf()/strtol() to parse an IP address, and not simply inet_pton()/inet_aton(), which are made precisely for that purpose?
inet_pton() is POSIX (it was introduced at the same time as getaddrinfo(), which is used unconditionally a couple lines below). If it's not available, we could use inet_aton() as fallback (which is amusingly not POSIX, but I doubt we'll find a platform with neither inet_pton() nor inet_aton()).

Using inet_pton() would have the added advantage of working with IPv6 addresses: right now, a numeric IPv6 address is always resolved by getaddrinfo(), which results in a performance overhead:

$ python -m timeit -s "from socket import *; s = socket(AF_INET, SOCK_DGRAM); DATA = 'hello'" "s.sendto(DATA, ('127.0.0.1', 42))"
100000 loops, best of 3: 6.86 usec per loop
$ python -m timeit -s "from socket import *; s = socket(AF_INET6, SOCK_DGRAM); DATA = 'hello'" "s.sendto(DATA, ('::1', 42))"
10000 loops, best of 3: 24.2 usec per loop
msg196454 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-08-29 10:12
Note that for IPv6 addresses, we should just extract manually the
scope ID ('%' prefix) if present.
msg196460 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-08-29 14:13
Here's a patch using inet_pton() if available, otherwise inet_aton()
if available, otherwise fallback to getaddrinfo().
This should work on every platform, but if a platform has neither
inet_pton() nor inet_aton(), calling getaddrinfo() will incur an
overhead.
All Unices have either inet_aton() or inet_pton().
For Windows >= Vista, inet_pton() is available.
I'm not sure whether XP has one of them: if not, then we might keep
the hand-parsing as a fallback.

This adds support for numeric IPv6 addresses (if inet_pton() is
available), which notably speeds up sendto() on IPv6 sockets, and
gives a tiny speedup for IPv4 sockets:
before:
$ ./python -m timeit -s "import socket; s =
socket.socket(socket.AF_INET, socket.SOCK_DGRAM); DATA = b'x'"
"s.sendto(DATA, ('127.0.0.1', 4242))"
100000 loops, best of 3: 13.2 usec per loop
$ ./python -m timeit -s "import socket; s =
socket.socket(socket.AF_INET6, socket.SOCK_DGRAM); DATA = b'x'"
"s.sendto(DATA, ('::1', 4242))"
10000 loops, best of 3: 32.1 usec per loop
after:
$ ./python -m timeit -s "import socket; s =
socket.socket(socket.AF_INET, socket.SOCK_DGRAM); DATA = b'x'"
"s.sendto(DATA, ('127.0.0.1', 4242))"
100000 loops, best of 3: 11.5 usec per loop
$ ./python -m timeit -s "import socket; s =
socket.socket(socket.AF_INET6, socket.SOCK_DGRAM); DATA = b'x'"
"s.sendto(DATA, ('::1', 4242))"
100000 loops, best of 3: 12.8 usec per loop

Note that if the IPv6 address contains a scope ID ('%' suffix), then
we fallback to getaddrinfo() which handles it properly (the scope ID
can be an interface index but also an interface name).
msg196462 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-08-29 15:30
> For Windows >= Vista, inet_pton() is available.
> I'm not sure whether XP has one of them: if not, then we might keep
> the hand-parsing as a fallback.

Apparently, XP doesn't have inet_aton()...

So I changed the code to use inet_pton() if available, otherwise use inet_addr(), which is available everywhere (the only gotcha is that it doesn't handle "255.255.255.255", so we special-case it like "<broadcast>").

We thus get rid of the hand-parsing completely, and get a more efficient  handling of numeric IPv4 addresses, and a much more efficient handling for numeric IPv6 addresses (see above benchmarks).
msg196465 - (view) Author: Michele Orrù (maker) * Date: 2013-08-29 17:13
Cool, thanks for pointing the correct library function. 

Tested on my machine - Debian 3.9 x86_64 GNU/Linux.
msg196863 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-09-03 18:54
Here's an updated patch with new tests.

It passes the regression test, and yields noticable performance improvements for IPv6:
before:
$ ./python -m timeit -s "import socket; s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM); DATA = b'hello'" "s.sendto(DATA, ('127.0.0.1', 4242)) "
10000 loops, best of 3: 28 usec per loop
$ ./python -m timeit -s "import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM); DATA = b'hello'" "s.sendto(DATA, ('::1', 4242)) "
10000 loops, best of 3: 59.1 usec per loop
after:
$ ./python -m timeit -s "import socket; s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM); DATA = b'hello'" "s.sendto(DATA, ('127.0.0.1', 4242)) "
10000 loops, best of 3: 24.8 usec per loop
$ ./python -m timeit -s "import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM); DATA = b'hello'" "s.sendto(DATA, ('::1', 4242)) "
10000 loops, best of 3: 26.7 usec per loop

Note that the tests aren't as good as I'd like them to, because apparently some systems (e.g. Solaris) have broken gethostbyaddr()...
But it's cleaner, more robust and more efficient.
msg196897 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-04 08:58
Perhaps you could add tests for the broadcast special cases?
msg197622 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-09-13 17:54
New changeset 540a9c69c2ea by Charles-François Natali in branch 'default':
Issue #16201: socket: Use inet_pton()/inet_addr() instead of ad-hoc parsing for
http://hg.python.org/cpython/rev/540a9c69c2ea
msg197693 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-09-14 07:22
Alright, I committed it to default.

I don't think it should go into 2.7 or 3.3, since it's more an
enhancement than a bug fix.
msg197705 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-09-14 12:06
Closing, thanks for the report!
History
Date User Action Args
2013-09-14 12:06:53neologixsetstatus: open -> closed
versions: - Python 2.7, Python 3.3
messages: + msg197705

resolution: fixed
stage: patch review -> resolved
2013-09-14 07:22:59neologixsetmessages: + msg197693
2013-09-13 17:54:21python-devsetnosy: + python-dev
messages: + msg197622
2013-09-04 08:58:14pitrousetversions: - Python 3.1, Python 3.2, Python 3.5
2013-09-04 08:58:04pitrousetmessages: + msg196897
2013-09-03 18:54:17neologixsetfiles: + parse_inet-1.diff

messages: + msg196863
2013-08-29 17:13:31makersetmessages: + msg196465
2013-08-29 15:30:45neologixsetfiles: - parse_inet.diff
2013-08-29 15:30:38neologixsetkeywords: + needs review
files: + parse_inet.diff
messages: + msg196462

stage: patch review
2013-08-29 14:13:59neologixsetfiles: + parse_inet.diff

messages: + msg196460
2013-08-29 10:12:56neologixsetmessages: + msg196454
2013-08-29 10:07:44neologixsetnosy: + neologix
messages: + msg196452
2013-08-29 09:34:06neologixlinkissue18806 superseder
2013-08-27 18:15:12makersetmessages: + msg196304
2012-10-13 23:27:46makersetfiles: + issue16201.3.patch

messages: + msg172840
2012-10-13 19:55:13pitrousetnosy: + pitrou
messages: + msg172825
2012-10-13 09:44:59makersetfiles: + issue16201.2.patch

messages: + msg172802
2012-10-13 08:59:24makersetfiles: + issue16201.1.patch

messages: + msg172801
2012-10-12 21:59:41vstinnersetnosy: + vstinner
messages: + msg172787
2012-10-12 08:13:52makersetfiles: + issue16201.patch
2012-10-12 07:34:16makersetnosy: + ezio.melotti
messages: + msg172723
2012-10-11 22:07:21santoso.wijayasetfiles: + gethostbyname_whitespace.patch
keywords: + patch
messages: + msg172700
2012-10-11 20:55:48santoso.wijayasetnosy: + santoso.wijaya

type: behavior
components: + Extension Modules
versions: + Python 2.7
2012-10-11 19:31:02makercreate