This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.request.urlopen sends POST data as query string
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: apetresc, henrik242
Priority: normal Keywords:

Created on 2020-03-06 11:45 by henrik242, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (10)
msg363502 - (view) Author: (henrik242) Date: 2020-03-06 11:45
curl correctly posts data to Solr:

$ curl -v 'http://solr.example.no:12699/solr/my_coll/update?commit=true' \
--data '<add><doc><field name="key">KEY__9927.1</field><field name="value">\
{"result":0,"jobId":"9459695","jobNumber":"9927.1"}</field></doc></add>'

The solr query log says:

[20200306T111354,131] [my_coll_shard1_replica_n85]  webapp=/solr path=/update params={commit=true} status=0 QTime=96

I'm trying to do the same thing with Python:

>>> import urllib.request
>>> data='<add><doc><field name="key">KEY__9927.1</field><field name="value">{"result":0,"jobId":"9459695","jobNumber":"9927.1"}</field></doc></add>'
>>> url='http://solr.example.no:12699/solr/my_coll/update?commit=true'
>>> req = urllib.request.Request(url=url, data=data.encode('utf-8'), method='POST')
>>> res = urllib.request.urlopen(req)

But now the solr query log shows that the POST data has been added to the query param string:

[20200306T112358,780] [my_coll_shard1_replica_n87]  webapp=/solr path=/update params={commit=true&<add><doc><field+name="key">KEY__9927.1</field><field+name%3D"value">{"result":0,"jobId":"9459695","jobNumber":"9927.1"}</field></doc></add>} status=0 QTime=30

What is happening here?

$ python3 -VV
Python 3.7.6
(default, Dec 30 2019, 19:38:26) 
[Clang 11.0.0 (clang-1100.0.33.16)]
msg363503 - (view) Author: Adrian Petrescu (apetresc) Date: 2020-03-06 13:14
This is not a bug, you've just misunderstood the urllib API. If you want to pass POST data as a payload, it's the second `data` parameter to `urlopen`: https://bugs.python.org/?@action=confrego&otk=KX9AqsI0JnOLkplIY1AGKXAmDKa38COy
msg363504 - (view) Author: Adrian Petrescu (apetresc) Date: 2020-03-06 13:16
(Oops, that was a bad paste! I meant this link: https://docs.python.org/2/library/urllib.html#urllib.urlopen)
msg363508 - (view) Author: (henrik242) Date: 2020-03-06 14:04
But why can't the payload be in the Request object?

From the api docs:

    class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

    data must be an object specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data. The supported object types include bytes, file-like objects, and iterables. 

https://docs.python.org/3.7/library/urllib.request.html#urllib.request.Request
msg363510 - (view) Author: (henrik242) Date: 2020-03-06 14:05
Further: 

method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise.
msg363511 - (view) Author: (henrik242) Date: 2020-03-06 14:15
Also, it seems that urllib.urlopen just creates a similar Request object when given a data paramenter:


    def open(self, fullurl, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT):
        # accept a URL or a Request object
        if isinstance(fullurl, str):
            req = Request(fullurl, data)
        else:
            req = fullurl
            if data is not None:
                req.data = data


From https://github.com/python/cpython/blob/3.7/Lib/urllib/request.py#L507 via https://github.com/python/cpython/blob/3.7/Lib/urllib/request.py#L222
msg363513 - (view) Author: (henrik242) Date: 2020-03-06 14:25
The following gives the same failing result too :(

>>> import urllib.request
>>> data = '<add><doc><field name="key">KEY__9927.1</field><field name="value">{"result":0,"jobId":"9459695","jobNumber":"9927.1"}</field></doc></add>'
>>> url = 'http://solr.example.no:12699/solr/my_coll/update?commit=true'
>>> res = urllib.request.urlopen(url, data.encode('utf-8'))

I guess I'll have to whip out Wireshark and see what's going on.
msg363515 - (view) Author: (henrik242) Date: 2020-03-06 14:44
Here's the wireshark output.  It seems that urllib adds a "Connection: close" which curl doesn't.  Solr doesn't seem to like that.


Curl message:

POST /solr/my_coll/update?commit=true HTTP/1.1
Host: solr.example.no:12699
User-Agent: curl/7.64.1
Accept: */*
Content-Length: 138
Content-Type: application/x-www-form-urlencoded

<add><doc><field name="key">KEY__9927.1</field><field name="value">{"result":0,"jobId":"9459695","jobNumber":"9927.1"}</field></doc></add>


Python message:

POST /solr/my_coll/update?commit=true HTTP/1.1
Accept-Encoding: identity
Content-Type: application/x-www-form-urlencoded
Content-Length: 138
Host: solr.example.no:12699
User-Agent: Python-urllib/3.7
Connection: close

<add><doc><field name="key">KEY__9927.1</field><field name="value">{"result":0,"jobId":"9459695","jobNumber":"9927.1"}</field></doc></add>
msg363546 - (view) Author: (henrik242) Date: 2020-03-06 20:34
Root cause for this seems to be https://bugs.python.org/issue12849
msg363696 - (view) Author: (henrik242) Date: 2020-03-09 06:58
Solved! 

The problem was Solr which it has special handling of POSTed data with the User-Agent starts with 'curl/': https://github.com/apache/lucene-solr/blob/40661489cd590947f513e553a20707d0c82b82e5/solr/core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L782

In all other cases Solr expects the Content-Type to be text/xml.  Setting that with urrlib.request makes the request work fine:

>>> req = urllib.request.Request(url, data.encode('utf-8'), headers={'Content-Type': 'text/xml'})
>>> res = urllib.request.urlopen(req)

A big thanks to https://stackoverflow.com/a/60586102/13365 for figuring this out
History
Date User Action Args
2022-04-11 14:59:27adminsetgithub: 84056
2020-03-09 06:58:17henrik242setstatus: open -> closed
resolution: not a bug
messages: + msg363696

stage: resolved
2020-03-06 20:34:35henrik242setmessages: + msg363546
2020-03-06 14:44:41henrik242setmessages: + msg363515
2020-03-06 14:25:57henrik242setmessages: + msg363513
2020-03-06 14:15:27henrik242setmessages: + msg363511
2020-03-06 14:05:44henrik242setmessages: + msg363510
2020-03-06 14:04:23henrik242setmessages: + msg363508
2020-03-06 13:16:04apetrescsetmessages: + msg363504
2020-03-06 13:14:20apetrescsetnosy: + apetresc
messages: + msg363503
2020-03-06 11:45:28henrik242create