Issue 43323: UnicodeEncodeError: surrogates not allowed when parsing invalid charset

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87489

classification

Title:	UnicodeEncodeError: surrogates not allowed when parsing invalid charset
Type:	behavior	Stage:	patch review
Components:	email, Library (Lib)	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	andersk, barry, glaubitz, mdengler, r.david.murray, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2021-02-25 18:18 by andersk, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 32137	open	serhiy.storchaka, 2022-03-27 09:45

Messages (9)
msg387685 - (view)	Author: Anders Kaseorg (andersk) *	Date: 2021-02-25 18:18
We ran into a UnicodeEncodeError exception using email.parser to parse this email <https://lists.cam.ac.uk/pipermail/cl-isabelle-users/2021-February/msg00135.html>, with full headers available in the raw archive <https://lists.cam.ac.uk/pipermail/cl-isabelle-users/2021-February.txt>. The offending header is hilariously invalid: Content-Type: text/plain; charset=utf-8”''utf-8%E2%80%9D but I’m filing an issue since the parser is intended to be robust against invalid input. Minimal reproduction: >>> import email, email.policy >>> email.message_from_bytes(b"Content-Type: text/plain; charset=utf-8\xE2\x80\x9D''utf-8%E2%80%9D", policy=email.policy.default) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.10/email/__init__.py", line 46, in message_from_bytes return BytesParser(args, *kws).parsebytes(s) File "/usr/local/lib/python3.10/email/parser.py", line 123, in parsebytes return self.parser.parsestr(text, headersonly) File "/usr/local/lib/python3.10/email/parser.py", line 67, in parsestr return self.parse(StringIO(text), headersonly=headersonly) File "/usr/local/lib/python3.10/email/parser.py", line 57, in parse return feedparser.close() File "/usr/local/lib/python3.10/email/feedparser.py", line 187, in close self._call_parse() File "/usr/local/lib/python3.10/email/feedparser.py", line 180, in _call_parse self._parse() File "/usr/local/lib/python3.10/email/feedparser.py", line 256, in _parsegen if self._cur.get_content_type() == 'message/delivery-status': File "/usr/local/lib/python3.10/email/message.py", line 578, in get_content_type value = self.get('content-type', missing) File "/usr/local/lib/python3.10/email/message.py", line 471, in get return self.policy.header_fetch_parse(k, v) File "/usr/local/lib/python3.10/email/policy.py", line 163, in header_fetch_parse return self.header_factory(name, value) File "/usr/local/lib/python3.10/email/headerregistry.py", line 608, in __call__ return self[name](name, value) File "/usr/local/lib/python3.10/email/headerregistry.py", line 196, in __new__ cls.parse(value, kwds) File "/usr/local/lib/python3.10/email/headerregistry.py", line 453, in parse kwds['decoded'] = str(parse_tree) File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 126, in __str__ return ''.join(str(x) for x in self) File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 126, in <genexpr> return ''.join(str(x) for x in self) File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 798, in __str__ for name, value in self.params: File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 783, in params value = value.decode(charset, 'surrogateescape') UnicodeEncodeError: 'utf-8' codec can't encode characters in position 5-7: surrogates not allowed
msg410426 - (view)	Author: John Paul Adrian Glaubitz (glaubitz)	Date: 2022-01-12 19:56
I'm running into exactly this issue when using 'offlineimap' which is written in Python.
msg416092 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-03-26 21:29
It is interesting that you get an UnicodeEncodeError when try to decode. Could the charser name contain non-ascii characters?
msg416101 - (view)	Author: Anders Kaseorg (andersk) *	Date: 2022-03-27 02:37
It could and does, as quoted in my original report. Content-Type: text/plain; charset*=utf-8”''utf-8%E2%80%9D That’s a U+201D right double quotation mark. This is not a valid charset for the charset of course, but it seems like the code was intended to handle an invalid charset value without crashing, so it should also handle an invalid charset charset (despite the absurdity of the entire concept of a charset charset).
msg416107 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-03-27 07:00
Sorry, I was puzzled by the exception type and missed details in a long traceback (I have issues with reading large texts). Thank you for your detailed report. The simple fix is to add UnicodeEncodeError to "except LookupError". But there may be other places where we can get a similar error. They should be fixed too. Alternatively we can do something when we get an invalid charset from the parsed data. I am not the email package expert, so I do not know what would be better in that context.
msg416109 - (view)	Author: John Paul Adrian Glaubitz (glaubitz)	Date: 2022-03-27 07:28
Hi Serhiy! > The simple fix is to add UnicodeEncodeError to "except LookupError". But there may be other places where we can get a similar error. They should be fixed too. I would be very interested to test this as this issue currently blocks my use of offlineimap. Would you mind creating a proof-of-concept patch for me if it's not too much work? Thanks!
msg416112 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-03-27 09:51
I fixed all suspicious places for which I found reproducers in PR 32137.
msg416113 - (view)	Author: John Paul Adrian Glaubitz (glaubitz)	Date: 2022-03-27 09:52
Awesome, thanks! I'll give it a try later today or tomorrow.
msg416225 - (view)	Author: John Paul Adrian Glaubitz (glaubitz)	Date: 2022-03-28 22:29
> Awesome, thanks! I'll give it a try later today or tomorrow. I have applied the patch and the problem seems to have been fixed. \o/

History
Date	User	Action	Args
2022-04-11 14:59:42	admin	set	github: 87489
2022-03-28 22:29:32	glaubitz	set	messages: + msg416225
2022-03-27 09:52:38	glaubitz	set	messages: + msg416113
2022-03-27 09:51:11	serhiy.storchaka	set	messages: + msg416112
2022-03-27 09:45:33	serhiy.storchaka	set	keywords: + patch stage: patch review pull_requests: + pull_request30217
2022-03-27 07:28:00	glaubitz	set	messages: + msg416109
2022-03-27 07:00:23	serhiy.storchaka	set	messages: + msg416107 components: + Library (Lib) versions: + Python 3.11, - Python 3.8
2022-03-27 02:37:49	andersk	set	messages: + msg416101
2022-03-26 21:29:09	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg416092
2022-03-26 21:06:41	mdengler	set	nosy: + mdengler
2022-01-12 19:56:33	glaubitz	set	nosy: + glaubitz messages: + msg410426
2021-02-27 03:52:23	terry.reedy	set	versions: - Python 3.6, Python 3.7
2021-02-27 03:52:12	terry.reedy	set	type: behavior
2021-02-25 18:18:08	andersk	create