classification
Title: email.header.make_header() doesn't work if any `ascii` code is out of range(128)
Type: behavior Stage: resolved
Components: email Versions: Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: aldwinaldwin, barry, maxking, r.david.murray, yunlee
Priority: normal Keywords: patch

Created on 2019-07-09 21:20 by yunlee, last changed 2019-08-01 18:45 by r.david.murray. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 14696 closed aldwinaldwin, 2019-07-11 02:46
Messages (6)
msg347577 - (view) Author: Yun Li (yunlee) Date: 2019-07-09 21:20
email.header.make_header() doesn't work if any `ascii` code is out of range(128)

For example 

>>> header = "Your booking at Voyager Int'l Hostel,=?UTF-8?B?IFBhbmFtw6EgQ2l0eQ==?=,   Panamá- Casco Antiguo"

>>> decode_header(header)
[(b"Your booking at Voyager Int'l Hostel,", None), (b' Panam\xc3\xa1 City', 'utf-8'), (b',   Panam\xe1- Casco Antiguo', None)]

>>> make_header(decode_header(header))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/email/header.py", line 174, in make_header
    h.append(s, charset)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/email/header.py", line 295, in append
    s = s.decode(input_charset, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 9: ordinal not in range(128)
msg347598 - (view) Author: Aldwin Pollefeyt (aldwinaldwin) * Date: 2019-07-10 04:36
Maybe a solution, if no charset defined, then encode it as utf-8 in decode_header, because it's Python3's default encoding?


diff --git a/Lib/email/header.py b/Lib/email/header.py
index 4ab0032bc6..8dbfe58a57 100644
--- a/Lib/email/header.py
+++ b/Lib/email/header.py
@@ -135,7 +135,10 @@ def decode_header(header):
     collapsed = []
     last_word = last_charset = None
     for word, charset in decoded_words:
-        if isinstance(word, str):
+        if not charset and isinstance(word, str):
+            word = word.encode('utf-8')
+            charset = 'utf-8'
+        elif isinstance(word, str):
             word = bytes(word, 'raw-unicode-escape')
         if last_word is None:
             last_word = word



Python 3.9.0a0 (heads/master:110a47c4f4, Jul 10 2019, 11:32:53) 
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import email.header
>>> header = "Your booking at Voyager Int'l Hostel,=?UTF-8?B?IFBhbmFtw6EgQ2l0eQ==?=,   Panamá- Casco Antiguo"
>>> print(email.header.make_header(email.header.decode_header(header)))
Your booking at Voyager Int'l Hostel, Panamá City,   Panamá- Casco Antiguo
>>>
msg347607 - (view) Author: Aldwin Pollefeyt (aldwinaldwin) * Date: 2019-07-10 08:00
Changing everything to utf-8 breaks a lot of tests, so here a less invasive solution?

diff --git a/Lib/email/header.py b/Lib/email/header.py
index 4ab0032bc6..1e71eeae7f 100644
--- a/Lib/email/header.py
+++ b/Lib/email/header.py
@@ -136,7 +136,14 @@ def decode_header(header):
     last_word = last_charset = None
     for word, charset in decoded_words:
         if isinstance(word, str):
-            word = bytes(word, 'raw-unicode-escape')
+            word_tmp = bytes(word, 'raw-unicode-escape')
+            input_charset = charset or 'us-ascii'
+            try:
+                _ = word_tmp.decode(input_charset, errors='strict')
+                word = word_tmp
+            except UnicodeDecodeError:
+                word = str(word).encode('utf-8')
+                charset = 'utf-8'
         if last_word is None:
             last_word = word
             last_charset = charset
msg348851 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2019-08-01 12:42
The input header is not valid (non-ascii is not allowed in headers), so you shouldn't expect make_header to do anything sensible.  Note that this is the legacy API, which is a toolkit and does not hold your hand when it comes to RFC compliance.  Aside from any other concerns, this is long standing behavior (it is the same in python2), and it doesn't make sense to change the behavior of a legacy API.
msg348868 - (view) Author: Yun Li (yunlee) Date: 2019-08-01 17:51
Hi, David:

I don't think your argument stands here. The whole world does not just
include English speaking countries. There are Spanish, Russian, Chinese,
etc. Any legacy packages should support all languages instead of just
English. This is definitely a bug in this package. I hope that the python
support team should fix this issue or simply add the "support English only"
description in the function explicitly . Otherwise it is very annoying for
other countries to use this package.

Thanks!
Yun

On Thu, Aug 1, 2019 at 5:42 AM R. David Murray <report@bugs.python.org>
wrote:

>
> R. David Murray <rdmurray@bitdance.com> added the comment:
>
> The input header is not valid (non-ascii is not allowed in headers), so
> you shouldn't expect make_header to do anything sensible.  Note that this
> is the legacy API, which is a toolkit and does not hold your hand when it
> comes to RFC compliance.  Aside from any other concerns, this is long
> standing behavior (it is the same in python2), and it doesn't make sense to
> change the behavior of a legacy API.
>
> ----------
> resolution:  -> not a bug
> stage: patch review -> resolved
> status: open -> closed
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue37532>
> _______________________________________
>
msg348871 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2019-08-01 18:45
Right, and the python email package fully supports non ascii:

>>> msg = EmailMessage()
>>> msg['Subject'] = "Panamá- Casco Antiguo"
>>> bytes(msg)
b'Subject: =?utf-8?q?Panam=C3=A1-?= Casco Antiguo\n\n'
>>> str(msg)
'Subject: Panamá- Casco Antiguo\n\n'
>>> msg['subject']
'Panamá- Casco Antiguo'

make_header also supports non-ascii, you just have to tell it what charset you want to use.  Like I said, make_header is part of the *legacy* API, and it really is a pain to use.  That's why we wrote the new API.
History
Date User Action Args
2019-08-01 18:45:58r.david.murraysetmessages: + msg348871
2019-08-01 17:51:55yunleesetmessages: + msg348868
2019-08-01 12:42:01r.david.murraysetstatus: open -> closed
resolution: not a bug
messages: + msg348851

stage: patch review -> resolved
2019-07-11 02:46:22aldwinaldwinsetkeywords: + patch
stage: patch review
pull_requests: + pull_request14496
2019-07-10 08:00:16aldwinaldwinsetmessages: + msg347607
2019-07-10 05:33:22xtreaksetnosy: + maxking
2019-07-10 04:36:50aldwinaldwinsetnosy: + aldwinaldwin
messages: + msg347598
2019-07-09 21:20:46yunleecreate