New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ill-formed surrogates not treated as errors during encoding/decoding #47922
Comments
The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or http://unicode.org/faq/utf_bom.html#30 Lone surrogate in UTF-8 (effectively CESU-8):
>>> '\xED\xA0\x81'.decode('utf-8')
u'\ud801'
Surrogate pair in UTF-8:
>>> '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8')
u'\ud801\udc00'
On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding
again will produce the proper non-surrogate scalar value. This has
security implications, although rare as characters outside the BMP are rare:
>>> u'\ud801\udc00'.encode('utf-16').decode('utf-16')
u'\U00010400'
Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails
(correctly), but encoding one does not:
>>> u'\ud801'.encode('utf-16')
'\xff\xfe\x01\xd8' I have gotten a report of a user decoding bad data using Fixing this would cause bpo-3297 to blow up loudly, rather than silently. |
We could fix it for 3.1, and perhaps leave 2.7 unchanged if some people |
While it's probably ok to fix the codecs, there's an issue which makes The marshal module uses utf-8 to write Unicode objects and these can and If the utf-8 codec were changed to raise an error for these, marshal It is likely that other existing Python code (outside the std lib) also Changing this would only be possible in 3.1. The marshal module would then also have to be changed to use a different See bpo-3297 for a discussion of UTF-8/16 vs. UCS2/4, the |
I think we could preserve the marshal format with yet another error |
On 2009-04-29 22:39, Martin v. Löwis @psf.upfronthosting.co.za wrote:
That's a good idea. We could have an error handler which then let's Still, this can only go into 3.1. |
Here is a patch that implements this proposed approach. It introduces a If this is accepted, the implementation of PEP-383 can be simplified |
rietveld: http://codereview.appspot.com/52081 |
Fixed indexing error. |
http://codereview.appspot.com/52081/diff/1/5 http://codereview.appspot.com/52081/diff/1/5#newcode326 http://codereview.appspot.com/52081/diff/1/5#newcode335 http://codereview.appspot.com/52081/diff/1/6 http://codereview.appspot.com/52081/diff/1/6#newcode544 http://codereview.appspot.com/52081/diff/1/4 http://codereview.appspot.com/52081/diff/1/4#newcode157 http://codereview.appspot.com/52081/diff/1/4#newcode2393 http://codereview.appspot.com/52081/diff/1/4#newcode4110 http://codereview.appspot.com/52081/diff/1/2 http://codereview.appspot.com/52081/diff/1/2#newcode758 http://codereview.appspot.com/52081/diff/1/2#newcode771 |
Reviewers: report_bugs.python.org, Benjamin, Message: http://codereview.appspot.com/52081/diff/1/5 http://codereview.appspot.com/52081/diff/1/5#newcode326
Done. http://codereview.appspot.com/52081/diff/1/5#newcode335
Done. http://codereview.appspot.com/52081/diff/1/6 http://codereview.appspot.com/52081/diff/1/6#newcode544
Done. http://codereview.appspot.com/52081/diff/1/4 http://codereview.appspot.com/52081/diff/1/4#newcode157
Done. http://codereview.appspot.com/52081/diff/1/4#newcode2393
Done. http://codereview.appspot.com/52081/diff/1/4#newcode4110
No. This is temporary - for PEP-383, I will have to support error http://codereview.appspot.com/52081/diff/1/2 http://codereview.appspot.com/52081/diff/1/2#newcode758
No. The interface is that an exception instance must be passed; http://codereview.appspot.com/52081/diff/1/2#newcode771
Done. Please review this at http://codereview.appspot.com/52081 Affected files: |
I think the new patch looks fine. |
Something I overlooked is that PyCodec_SurrogateErrors isn't exposed in |
Committed as r72208, blocked as r72209. As for PyCodec_SurrogateErrors: I'd rather make it static than expose it. |
2009/5/2 <"\"Martin v. Löwis\"
Why? All the other error handlers are exposed. |
Sure - but what for? IMO, they all shouldn't be exposed. |
2009/5/2 <"\"Martin v. Löwis\"
<report@bugs.python.org>"@psf.upfronthosting.co.za>:
>
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
>>> As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.
>>
>> Why? All the other error handlers are exposed.
>
> Sure - but what for? IMO, they all shouldn't be exposed. The only reason I can think of is consistency, but I don't care that much. |
New changeset 2150eadb54c7 by Serhiy Storchaka in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: