This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: email.message.Message set_charset does not encode properly?
Type: behavior Stage: resolved
Components: Documentation, Library (Lib) Versions: Python 3.1, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: Shay.Rojansky, python-dev, r.david.murray, sdaoden
Priority: normal Keywords: patch

Created on 2011-02-14 20:26 by Shay.Rojansky, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
set_charset_doc.patch r.david.murray, 2011-02-22 00:28 review
Messages (11)
msg128573 - (view) Author: Shay Rojansky (Shay.Rojansky) Date: 2011-02-14 20:26
This may be my misunderstanding of the correct behavior, thanks in advance for your patience...

The issue happens when calling set_charset (or set_payload charset) after a Message has already been created and contains a Content-Transfer-Encoding header. Here's an example:

>>> from email.mime.text import MIMEText
>>> 
>>> part = MIMEText('Some stuff aéàçça', 'plain', 'utf-8')
>>> print part
From nobody Mon Feb 14 19:57:17 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64

U29tZSBzdHVmZiBhw6nDoMOnw6dh

>>> part.set_payload('Other stuff aéàçça', 'utf-8')
>>> print part
From nobody Mon Feb 14 19:57:25 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64

Other stuff aéàçça
>>> 
>>> del part['Content-Transfer-Encoding']
>>> part.set_payload('Still some other stuff aéàçça', 'utf-8')
>>> print part
From nobody Mon Feb 14 19:57:40 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64

U3RpbGwgc29tZSBvdGhlciBzdHVmZiBhw6nDoMOnw6dh

First the text part is created with charset UTF-8, a dump shows a properly-encoded base64 UTF-8 part.

Then an attempt is made to modify the payload. The set_charset documentation clearly states that the message will be properly encoded/converted, but we get a malformed part with Content-Transfer-Endogin=base64 but without a base64-encoded payload.

Finally, as a workaround, I delete the Content-Transfer-Encoding header and try again, at which point the new payload is properly encoded.

Again, I'm sure there are reasons for this behavior, which nevertheless seems like a bug to me (shouldn't set_charset perform base64 and change the Content-Transfer-Encoding if necessary regardless of previous headers?). Maybe a documentation update would help people with this.

Thank you very much!
msg128574 - (view) Author: Shay Rojansky (Shay.Rojansky) Date: 2011-02-14 20:37
Sorry, the "print part" below doesn't work on Python3 (I'm back in 2.6), but I see the same trouble by using part.__str__()
msg128575 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-02-14 20:58
Well, the docs don't say that the headers will be removed or modified as needed, only added as needed ;/.  And in fact the set_charset code does "if 'Content-Transfer-Encoding' not in self".  set_payload also says it is the caller's responsibility to "maintain the payload's invariants".  I'm not sure what that means, but I suspect it means making sure it is consistent with any existing Content-Transfer-Encoding header.

So it looks to me like this is (a) a doc bug and (b) an API bug(*).  The latter can only be corrected in 3.3.

I'm open to argument that this should be treated as a code bug, but that argument needs to include reasons why fixing it would not cause backward incompatibility problems.

Although I assigned this to myself so I don't lose it, a proposed documentation patch would be welcome.

(*) I say an API bug because my guess is that the API works the way it does on the theory that a program might be updating an existing Message object and should in that instance change only what it is directed to change (the payload content, and the charset attribute if specified), even if that would produce an invalid Message.  The way the email package evolved this would make sense, but I think if we were (when we are) doing it over now a better API would be to maintain a valid Message object through the standard API and provide a "lower level" API for those programs needing to produce RFC-invalid messages for whatever reason.
msg128583 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-02-15 11:05
Shay.Rojansky: because of the fact that i needed a free last saturday for just having the time to click around a bit to find the relevant docs in the python.org jungle ...

<http://wiki.python.org/moin/Email%20SIG> and *especially* <http://wiki.python.org/moin/Email%20SIG/DesignThoughts> may be of interest for you - read those when you are starting to think that it's you who is broken.  Nope!  It's really the package!

P.S.: sorry, David, i really didn't know, and somehow i also missed that single minute to look at bitdance.com.  But now i've found the docs and maybe here time is being worth enough for a broken package to be fixed,  the right way.  Then even more people can earn more money in a hurry - with it.  (It's only school english in the end.)
msg128813 - (view) Author: Shay Rojansky (Shay.Rojansky) Date: 2011-02-18 20:24
Sorry for disappearing, will be following this closer from now on.

Thanks for the responses (and the acknowledgement of a problem), David and Steffen.

I understand the API design philosophy and the need to allow the production of invalid messages. The main problem point to me is what seems to be an inconsistency in the API:
* When constructing a new message/part (i.e. in the constructor), management of the Content-Transfer-Encoding header and responsibility for doing base64 is done by the API.
* But when using mutators (set_payload, set_charset), these become the responsibility of the caller, etc.

So one aspect of the API is high-level, another low-level. I have no idea whether this was a planned thing in some way, and have a hard time estimating whether modifying the behavior (i.e. making set_payload/set_charset high-level like the constructor) would break apps that depend on the current low-level behavior...

One thought though... Another part of the API already allows the user to indicate whether they want low or high level behavior: get_payload accepts the boolean decode flag. Would it not make sense to add a similar "encode" flag to set_payload? To be complete such a flag could also be added to the constructor.

Apart from that, I guess a good doc patch would run along the lines of appending the following to set_charset's description:
"Note that if a Content-Transfer-Encoding header is already present on the message, it is kept as is and no content transfer encoding is applied to the payload."



This can
msg129009 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-02-21 23:19
While your idea has merit, we can only add a parameter in a new version (not in a bug-fix release), so the doc fix is all we can do before 3.3.
msg129010 - (view) Author: Shay Rojansky (Shay.Rojansky) Date: 2011-02-21 23:40
Thanks and no problem for me, the workaround (deleting the header) works just fine.

I'm not sure if/when the more general discussion on the package will take place (low-level vs. high-level), I would be interested in following.
msg129015 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-02-22 00:28
Discussion of the new API, and what changes will be made for 3.3, will take place on the email-sig mailing list (see http://mail.python.org).  Please join!  Right now it is a dormant list, but I plan to post some stuff soon :)

The more I stare at the set_charset code and try to write out the actual algorithm in words, the buggier this looks.  If the input and output charsets differ, the body encoding is done unconditionally, but the CTE is set in that case only if the CTE header doesn't exist.

I've attached a proposed doc patch which tries to describe the actual algorithm.  If you'd care to take a look and see if you see any mistakes in my description, that would be great.  Note that you have to go down a couple of twisty passages if you want to fully understand what the set_charset code is doing.
msg131044 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-03-15 21:43
New changeset cf1859f9aed9 by R David Murray in branch '3.1':
#11216: document all possible set_charset execution paths.
http://hg.python.org/cpython/rev/cf1859f9aed9

New changeset 4d25b9d2aa05 by R David Murray in branch '3.2':
Merge #11216: document all possible set_charset execution paths.
http://hg.python.org/cpython/rev/4d25b9d2aa05

New changeset 2dd70fd26f24 by R David Murray in branch 'default':
Merge #11216: document all possible set_charset execution paths.
http://hg.python.org/cpython/rev/2dd70fd26f24
msg131046 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-03-15 22:01
New changeset 43ccd051ec6c by R David Murray in branch '2.7':
Merge #11216: document all possible set_charset execution paths.
http://hg.python.org/cpython/rev/43ccd051ec6c
msg131048 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-03-15 22:04
I've committed the patch as is.  We'll address improving the semantics one way or another as part of email6 development.
History
Date User Action Args
2022-04-11 14:57:12adminsetgithub: 55425
2011-03-15 22:04:59r.david.murraysetstatus: open -> closed
nosy: r.david.murray, sdaoden, Shay.Rojansky, python-dev
messages: + msg131048

resolution: fixed
stage: patch review -> resolved
2011-03-15 22:01:39python-devsetnosy: r.david.murray, sdaoden, Shay.Rojansky, python-dev
messages: + msg131046
2011-03-15 21:43:48python-devsetnosy: + python-dev
messages: + msg131044
2011-02-22 00:28:56r.david.murraysetfiles: + set_charset_doc.patch
nosy: r.david.murray, sdaoden, Shay.Rojansky
messages: + msg129015

components: + Documentation
keywords: + patch
stage: needs patch -> patch review
2011-02-21 23:40:29Shay.Rojanskysetnosy: r.david.murray, sdaoden, Shay.Rojansky
messages: + msg129010
2011-02-21 23:19:25r.david.murraysetnosy: r.david.murray, sdaoden, Shay.Rojansky
messages: + msg129009
versions: + Python 2.7, Python 3.2, Python 3.3, - Python 2.6
2011-02-18 20:24:35Shay.Rojanskysetnosy: r.david.murray, sdaoden, Shay.Rojansky
messages: + msg128813
2011-02-15 11:05:32sdaodensetnosy: + sdaoden
messages: + msg128583
2011-02-14 20:58:57r.david.murraysetnosy: + r.david.murray
messages: + msg128575

assignee: r.david.murray
stage: needs patch
2011-02-14 20:37:39Shay.Rojanskysetmessages: + msg128574
2011-02-14 20:26:18Shay.Rojanskycreate