classification
Title: UnicodeEncodeError: surrogates not allowed when parsing invalid charset
Type: behavior Stage:
Components: email Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: andersk, barry, glaubitz, r.david.murray
Priority: normal Keywords:

Created on 2021-02-25 18:18 by andersk, last changed 2022-01-12 19:56 by glaubitz.

Messages (2)
msg387685 - (view) Author: Anders Kaseorg (andersk) * Date: 2021-02-25 18:18
We ran into a UnicodeEncodeError exception using email.parser to parse this email <https://lists.cam.ac.uk/pipermail/cl-isabelle-users/2021-February/msg00135.html>, with full headers available in the raw archive <https://lists.cam.ac.uk/pipermail/cl-isabelle-users/2021-February.txt>.  The offending header is hilariously invalid:

Content-Type: text/plain; charset*=utf-8”''utf-8%E2%80%9D

but I’m filing an issue since the parser is intended to be robust against invalid input.  Minimal reproduction:

>>> import email, email.policy
>>> email.message_from_bytes(b"Content-Type: text/plain; charset*=utf-8\xE2\x80\x9D''utf-8%E2%80%9D", policy=email.policy.default)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/email/__init__.py", line 46, in message_from_bytes
    return BytesParser(*args, **kws).parsebytes(s)
  File "/usr/local/lib/python3.10/email/parser.py", line 123, in parsebytes
    return self.parser.parsestr(text, headersonly)
  File "/usr/local/lib/python3.10/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/local/lib/python3.10/email/parser.py", line 57, in parse
    return feedparser.close()
  File "/usr/local/lib/python3.10/email/feedparser.py", line 187, in close
    self._call_parse()
  File "/usr/local/lib/python3.10/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/local/lib/python3.10/email/feedparser.py", line 256, in _parsegen
    if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/local/lib/python3.10/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/local/lib/python3.10/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/local/lib/python3.10/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
  File "/usr/local/lib/python3.10/email/headerregistry.py", line 608, in __call__
    return self[name](name, value)
  File "/usr/local/lib/python3.10/email/headerregistry.py", line 196, in __new__
    cls.parse(value, kwds)
  File "/usr/local/lib/python3.10/email/headerregistry.py", line 453, in parse
    kwds['decoded'] = str(parse_tree)
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 126, in __str__
    return ''.join(str(x) for x in self)
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 126, in <genexpr>
    return ''.join(str(x) for x in self)
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 798, in __str__
    for name, value in self.params:
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 783, in params
    value = value.decode(charset, 'surrogateescape')
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 5-7: surrogates not allowed
msg410426 - (view) Author: John Paul Adrian Glaubitz (glaubitz) Date: 2022-01-12 19:56
I'm running into exactly this issue when using 'offlineimap' which is written in Python.
History
Date User Action Args
2022-01-12 19:56:33glaubitzsetnosy: + glaubitz
messages: + msg410426
2021-02-27 03:52:23terry.reedysetversions: - Python 3.6, Python 3.7
2021-02-27 03:52:12terry.reedysettype: behavior
2021-02-25 18:18:08anderskcreate