classification
Title: xmlrpclib does not handle some non-printable characters properly
Type: behavior Stage: resolved
Components: XML Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: xmlrpc library returns string which contain null ( \x00 )
View: 7727
Assigned To: Nosy List: eric.araujo, gyorkop, loewis
Priority: normal Keywords:

Created on 2010-10-11 16:16 by gyorkop, last changed 2016-01-20 10:20 by serhiy.storchaka. This issue is now closed.

Messages (8)
msg118376 - (view) Author: Peter Gyorko (gyorkop) Date: 2010-10-11 16:16
If I add a string to the response, which contains non-printable characters, the output will not be parsed by the most of the XML parsers (I tried with XML-RPC for PHP).

Here is my quick and dirty fix:

--- a/Lib/xmlrpclib.py
+++ b/Lib/xmlrpclib.py
@@ -165,9 +165,18 @@ def _decode(data, encoding, is8bit=re.compile("[\x80-\xff]").search):
     return data
 
 def escape(s, replace=string.replace):
-    s = replace(s, "&", "&")
-    s = replace(s, "<", "&lt;")
-    return replace(s, ">", "&gt;",)
+    res = ''
+    for char in s:
+        char_code = ord(char)
+        if (char_code < 32 and char_code not in (9, 10, 13)) or char_code > 126:
+            res += '\\x%02x' % ord(char)
+        else:
+            res += char
+
+    res = replace(res, "&", "&amp;")
+    res = replace(res, "<", "&lt;")
+    res = replace(res, ">", "&gt;")
+    return res
 
 if unicode:
     def _stringify(string):
msg118434 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-12 16:52
Thanks for the report.  Unfortunately, 2.6 only gets security fixes, not general bug fixes; can you tell if this applies to 2.7, 3.1 and 3.2?  If you have a small script that displays the problem, please attach it.
msg118439 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-12 17:17
The patch is incorrect. Even though this may let get these characters "through", the other end will have no clue that \x is meant as an escape.

Please face the ugly truth: XML (and hence XML-RPC) just does not support these characters, see http://www.w3.org/TR/REC-xml/#NT-Char

This is fixed slightly in XML 1.1 (which allows to refer to these characters by character reference), however, XML 1.1 is not used for XML-RPC, so this is not an option.

I'm closing this as "won't fix". If there is interest in doing something about it, I could accept a patch that rejects non-Char characters with an exception, instead of sending ill-formed XML. I'd recommend to use a regular expression for that.
msg118522 - (view) Author: Peter Gyorko (gyorkop) Date: 2010-10-13 14:16
The shortest code which can trigger this error is the following:

>>> import xmlrpclib
>>> print xmlrpclib.dumps(('\x01',))
<params>
<param>
<value><string></string></value>
</param>
</params>

As you can see, the escape method does not care about non-printable characters which can cause parsing error in the other side.

My previous patch used \x to tell to the other side that the value contains some binary garbage. It you want to reject these binary bytes (which was not acceptable in my case), use this patch:

--- a/xmlrpclib.py	2010-10-13 14:45:02.000000000 +0200
+++ b/xmlrpclib.py	2010-10-13 16:03:14.000000000 +0200
@@ -165,6 +165,9 @@
     return data
 
 def escape(s, replace=string.replace):
+    if (None != re.search('[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f-\xff]', s)):
+        raise Fault(INVALID_ENCODING_CHAR, 'Non-printable character in string')
+
     s = replace(s, "&", "&amp;")
     s = replace(s, "<", "&lt;")
     return replace(s, ">", "&gt;",)

An other idea: we may use CDATA (http://www.w3schools.com/xml/xml_cdata.asp) to transfer binary values...
msg118558 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-13 18:32
No, CDATA is not an appropriate mechanism to encapsulate bytes in XML. The data in the CDATA section must still match the Char production, and it must still follow the encoding.
msg118561 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-13 18:41
Thanks for the new patch.  I suggest one style change before committing:

 if re.search(<snip>, s) is not None:

Agree that CDATA is not appropriate.

Martin: I’m reopening the bug in reaction to your message “I could accept a patch that rejects non-Char characters with an exception”, hope it’s fine.
msg118564 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-13 19:06
Éric, I think the patch needs some rework. First, it is incorrect/incomplete: please see the Char definition for a complete list of characters that must be excluded. This then raises a Unicode vs. bytes issue, where invalid Unicode characters must be prohibited before the string actually being encoded (since apply the regex to the encoded string is not practical).

The other side of the bytes vs. string issue is that the bytes really ought to be in self.encoding, which doesn't get checked, either.

And then, it lacks tests.
msg118716 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-14 21:03
My bad, I read too fast.

Peter: Can you rework your patch to address Martin’s review?
History
Date User Action Args
2016-01-20 10:20:00serhiy.storchakasetstatus: open -> closed
superseder: xmlrpc library returns string which contain null ( \x00 )
resolution: duplicate
stage: patch review -> resolved
2010-10-14 21:03:59eric.araujosetmessages: + msg118716
2010-10-13 19:06:03loewissetmessages: + msg118564
2010-10-13 18:41:37eric.araujosetstatus: closed -> open
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
messages: + msg118561

resolution: wont fix -> (no value)
stage: patch review
2010-10-13 18:32:19loewissetmessages: + msg118558
2010-10-13 14:16:53gyorkopsetmessages: + msg118522
2010-10-12 17:17:03loewissetstatus: open -> closed
resolution: wont fix
messages: + msg118439
2010-10-12 16:52:46eric.araujosetnosy: + loewis, eric.araujo
messages: + msg118434
2010-10-11 16:16:39gyorkopcreate