UnicodeEncodeError: surrogates not allowed when parsing invalid charset #87489

andersk · 2021-02-25T18:18:08Z

BPO	43323
Nosy	@warsaw, @bitdancer, @andersk, @serhiy-storchaka, @glaubitz
PRs	bpo-43323: Fix UnicodeEncodeError in the email module #32137

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-02-25.18:18:08.224>
labels = ['type-bug', 'expert-email', '3.10', '3.11', 'library', '3.9']
title = 'UnicodeEncodeError: surrogates not allowed when parsing invalid charset'
updated_at = <Date 2022-03-28.22:29:32.019>
user = 'https://github.com/andersk'

bugs.python.org fields:

activity = <Date 2022-03-28.22:29:32.019>
actor = 'glaubitz'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'email']
creation = <Date 2021-02-25.18:18:08.224>
creator = 'andersk'
dependencies = []
files = []
hgrepos = []
issue_num = 43323
keywords = ['patch']
message_count = 9.0
messages = ['387685', '410426', '416092', '416101', '416107', '416109', '416112', '416113', '416225']
nosy_count = 6.0
nosy_names = ['barry', 'r.david.murray', 'andersk', 'serhiy.storchaka', 'mdengler', 'glaubitz']
pr_nums = ['32137']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue43323'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

andersk · 2021-02-25T18:18:08Z

We ran into a UnicodeEncodeError exception using email.parser to parse this email <https://lists.cam.ac.uk/pipermail/cl-isabelle-users/2021-February/msg00135.html\>, with full headers available in the raw archive <https://lists.cam.ac.uk/pipermail/cl-isabelle-users/2021-February.txt\>. The offending header is hilariously invalid:

Content-Type: text/plain; charset*=utf-8”''utf-8%E2%80%9D

but I’m filing an issue since the parser is intended to be robust against invalid input. Minimal reproduction:

>>> import email, email.policy
>>> email.message_from_bytes(b"Content-Type: text/plain; charset*=utf-8\xE2\x80\x9D''utf-8%E2%80%9D", policy=email.policy.default)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/email/__init__.py", line 46, in message_from_bytes
    return BytesParser(*args, **kws).parsebytes(s)
  File "/usr/local/lib/python3.10/email/parser.py", line 123, in parsebytes
    return self.parser.parsestr(text, headersonly)
  File "/usr/local/lib/python3.10/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/local/lib/python3.10/email/parser.py", line 57, in parse
    return feedparser.close()
  File "/usr/local/lib/python3.10/email/feedparser.py", line 187, in close
    self._call_parse()
  File "/usr/local/lib/python3.10/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/local/lib/python3.10/email/feedparser.py", line 256, in _parsegen
    if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/local/lib/python3.10/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/local/lib/python3.10/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/local/lib/python3.10/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
  File "/usr/local/lib/python3.10/email/headerregistry.py", line 608, in __call__
    return self[name](name, value)
  File "/usr/local/lib/python3.10/email/headerregistry.py", line 196, in __new__
    cls.parse(value, kwds)
  File "/usr/local/lib/python3.10/email/headerregistry.py", line 453, in parse
    kwds['decoded'] = str(parse_tree)
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 126, in __str__
    return ''.join(str(x) for x in self)
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 126, in <genexpr>
    return ''.join(str(x) for x in self)
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 798, in __str__
    for name, value in self.params:
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 783, in params
    value = value.decode(charset, 'surrogateescape')
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 5-7: surrogates not allowed

glaubitz · 2022-01-12T19:56:34Z

I'm running into exactly this issue when using 'offlineimap' which is written in Python.

serhiy-storchaka · 2022-03-26T21:29:09Z

It is interesting that you get an UnicodeEncodeError when try to decode. Could the charser name contain non-ascii characters?

andersk · 2022-03-27T02:37:49Z

It could and does, as quoted in my original report.

Content-Type: text/plain; charset*=utf-8”''utf-8%E2%80%9D

That’s a U+201D right double quotation mark.

This is not a valid charset for the charset of course, but it seems like the code was intended to handle an invalid charset value without crashing, so it should also handle an invalid charset charset (despite the absurdity of the entire concept of a charset charset).

serhiy-storchaka · 2022-03-27T07:00:24Z

Sorry, I was puzzled by the exception type and missed details in a long traceback (I have issues with reading large texts). Thank you for your detailed report.

The simple fix is to add UnicodeEncodeError to "except LookupError". But there may be other places where we can get a similar error. They should be fixed too.

Alternatively we can do something when we get an invalid charset from the parsed data. I am not the email package expert, so I do not know what would be better in that context.

glaubitz · 2022-03-27T07:28:01Z

Hi Serhiy!

The simple fix is to add UnicodeEncodeError to "except LookupError". But there may be other places where we can get a similar error. They should be fixed too.

I would be very interested to test this as this issue currently blocks my use of offlineimap.

Would you mind creating a proof-of-concept patch for me if it's not too much work?

Thanks!

serhiy-storchaka · 2022-03-27T09:51:11Z

I fixed all suspicious places for which I found reproducers in PR 32137.

glaubitz · 2022-03-27T09:52:39Z

Awesome, thanks! I'll give it a try later today or tomorrow.

glaubitz · 2022-03-28T22:29:32Z

Awesome, thanks! I'll give it a try later today or tomorrow.

I have applied the patch and the problem seems to have been fixed. \o/

furkanonder · 2023-05-08T23:58:54Z

@serhiy-storchaka Issue seems to resolved. We can close the issue.

andersk mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.10 only security fixes 3.9 only security fixes topic-email labels Feb 25, 2021

terryjreedy added type-bug An unexpected behavior, bug, or error and removed 3.7 (EOL) end of life labels Feb 27, 2021

serhiy-storchaka added stdlib Python modules in the Lib dir 3.11 only security fixes and removed 3.8 only security fixes labels Mar 27, 2022

ezio-melotti transferred this issue from another repository Apr 10, 2022

pochmann3 mentioned this issue Mar 13, 2024

Error in email module exception handling #116705

Open

serhiy-storchaka closed this as completed Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError: surrogates not allowed when parsing invalid charset #87489

UnicodeEncodeError: surrogates not allowed when parsing invalid charset #87489

andersk mannequin commented Feb 25, 2021

andersk mannequin commented Feb 25, 2021

glaubitz mannequin commented Jan 12, 2022

serhiy-storchaka commented Mar 26, 2022

andersk mannequin commented Mar 27, 2022

serhiy-storchaka commented Mar 27, 2022

glaubitz mannequin commented Mar 27, 2022

serhiy-storchaka commented Mar 27, 2022

glaubitz mannequin commented Mar 27, 2022

glaubitz mannequin commented Mar 28, 2022

furkanonder commented May 8, 2023

Navigation Menu

UnicodeEncodeError: surrogates not allowed when parsing invalid charset #87489

UnicodeEncodeError: surrogates not allowed when parsing invalid charset #87489

Comments

andersk mannequin commented Feb 25, 2021

andersk mannequin commented Feb 25, 2021

glaubitz mannequin commented Jan 12, 2022

serhiy-storchaka commented Mar 26, 2022

andersk mannequin commented Mar 27, 2022

serhiy-storchaka commented Mar 27, 2022

glaubitz mannequin commented Mar 27, 2022

serhiy-storchaka commented Mar 27, 2022

glaubitz mannequin commented Mar 27, 2022

glaubitz mannequin commented Mar 28, 2022

furkanonder commented May 8, 2023