classification
Title: Charset.header_encode in email.charset doesn't take a maxlinelen argument and has inconsistent behavior with different encodings
Type: behavior Stage:
Components: email, Library (Lib) Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, r.david.murray, rednaw
Priority: normal Keywords:

Created on 2014-02-23 17:59 by rednaw, last changed 2014-02-24 16:09 by r.david.murray.

Messages (9)
msg212011 - (view) Author: Rik (rednaw) Date: 2014-02-23 17:59
If you look at the `header_encode` method in the `Charset` class in `email.charset`, you'll see that depending on the `header_encoding` that is set on the `Charset` instance, it will either encode it using base64 or quoted-printable (QP):

http://hg.python.org/cpython/file/3a1db0d2747e/Lib/email/charset.py#l351

However, QP always uses `maxlinelen=None` and base64 doesn't. This results in the following behaviour:

- If you use base64 encoding and your header size is longer than the default `maxlinelen`, it will be split over multiple lines.
- If you use QP encoding with the same header it doesn't get split over multiple lines.

You can easily test it with this snippet:

    from email.charset import Charset, BASE64, QP

    header = (
        'tejkstj tlkjes takldjf aseio neaoiflk asnfoieas nflkdan foeias '
        'naskln ioeasn kldan flkansoie naslk dnaslk fndaslk fneoisaf '
        'neklasn dfklasnf oiasenf lkadsn lkfanldk fas dfknaioe nas'
    )

    charset = Charset('utf-8')

    charset.header_encoding = BASE64
    print 'BASE64:'
    print charset.header_encode(header)

    charset.header_encoding = QP
    print 'QP:'
    print charset.header_encode(header)

Which will output:

    BASE64:
    =?utf-8?b?dGVqa3N0aiB0bGtqZXMgdGFrbGRqZiBhc2VpbyBuZWFvaWZsayBhc25mb2llYXMg?=
     =?utf-8?b?bmZsa2RhbiBmb2VpYXMgbmFza2xuIGlvZWFzbiBrbGRhbiBmbGthbnNvaWUgbmFz?=
     =?utf-8?b?bGsgZG5hc2xrIGZuZGFzbGsgZm5lb2lzYWYgbmVrbGFzbiBkZmtsYXNuZiBvaWFz?=
     =?utf-8?b?ZW5mIGxrYWRzbiBsa2ZhbmxkayBmYXMgZGZrbmFpb2UgbmFz?=
    QP:
    =?utf-8?q?tejkstj_tlkjes_takldjf_aseio_neaoiflk_asnfoieas_nflkdan_foeias_naskln_ioeasn_kldan_flkansoie_naslk_dnaslk_fndaslk_fneoisaf_neklasn_dfklasnf_oiasenf_lkadsn_lkfanldk_fas_dfknaioe_nas?=

This is inconsistent behavior.

Aside from that, I think the `header_encode` method should accept an argument `maxlinelen` that defaults to an appropriate value (probably 76), but which you can overwrite on free will.

This is (I think) also necessary because the `Header` class in `email.header` has a `maxlinelen` attribute that is used for the same purpose. Normally this works fine, but when you specified a charset for your header, it uses the `Charset` class and the `maxlinelen` is lost. This is happening here:

http://hg.python.org/cpython/file/3a1db0d2747e/Lib/email/header.py#l368

You see, the `_encode_chunks` takes the `maxlinelen` argument but doesn't pass it on to the `header_encode` method of `charset` (which is a `Charset` instance).

As such, you can see this issue in action with the following snippet:

    from email.header import Header

    maxlinelen = 9999999

    print 'No charset:'
    print Header(
        u'asdfjk lasjdf sajdfl ajsdfaj sdlkfjas kfladjs flkajsdflk jsadklf jadslkfj adslkfj asdlkjf lksadjfkldas jfkldasj fkadsj fladsjf kladsjfk asdjfkldasasd kfaj  kfladsj fkadsjf asdf ',
        maxlinelen=maxlinelen
    ).encode()

    print 'Charset with special characters:'
    print Header(
        u'attachment; filename="ajdsklfj klasdjfkl asdjfkl jadsfja sdflkads fad fads adsf dasjfkl jadslkfj dlasf asd \u6211\u6211\u6211 jo \u6211\u6211 jo \u6211\u6211"',
        charset='utf-8',
        maxlinelen=9999999
    ).encode()

Which will output:

    No charset:
    asdfjk lasjdf sajdfl ajsdfaj sdlkfjas kfladjs flkajsdflk jsadklf jadslkfj adslkfj asdlkjf lksadjfkldas jfkldasj fkadsj fladsjf kladsjfk asdjfkldasasd kfaj  kfladsj fkadsjf asdf
    Charset with special characters:
    =?utf-8?b?YXR0YWNobWVudDsgZmlsZW5hbWU9ImFqZHNrbGZqIGtsYXNkamZrbCBhc2RqZmts?=
     =?utf-8?b?IGphZHNmamEgc2RmbGthZHMgZmFkIGZhZHMgYWRzZiBkYXNqZmtsIGphZHNsa2Zq?=
     =?utf-8?b?IGRsYXNmIGFzZCDmiJHmiJHmiJEgam8g5oiR5oiRIGpvIOaIkeaIkSI=?=

This is currently an issue we're experiencing in Django, see our issue in the issue tracker:
https://code.djangoproject.com/ticket/20889#comment:4
msg212045 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-23 23:47
The line wrapping is done by Header, not header_encode.  The bug appears to be that maxlinelen=None is not passed to base64mime's header_encode the way it is to quoprimime's header_encode...and that base64mime doesn't handle a maxlinelen of None.  Using maxlinelen=9999999 in the base64mime.header_encode calll, your base64 example also results in a single line header.

This should be fixed.  It does not affect python3, which uses a different folding algorithm.
msg212072 - (view) Author: Rik (rednaw) Date: 2014-02-24 09:02
Line wrapping is indeed done by `Header`, but why do `base64mime` and `quoprimime` then have their own line wrapping? I assume so that you can also use them independently. So that's why I would think `Charset.header_encode` should also accept a `maxlinelen` so that you can use `Charset` independently too.
msg212093 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-24 13:26
I've no clue, to tell you the truth.  Those APIs evolved long before I took over email package maintenance.  And since we are talking about 2.7,  we can't change the existing API.  In Python3, Charset.header_encode will as of 3.5 become a legacy interface, so there's not much point in changing it there either, although it is not out of the question if there is a use case.
msg212098 - (view) Author: Rik (rednaw) Date: 2014-02-24 13:46
Ok, so you suggest to use `maxlinelen=None` for the `base64mime.header_encode` which will act the same as giving `maxlinelen=None` to `email.quoprimime`, so that we don't need to change the API?

And this change would then also be reflected in the Python 3.5 legacy interface?
msg212101 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-24 14:28
Well, we have to make base64mime.header_encode also handle a None value...so perhaps instead we should just use 10000, which is what the Header wrapping code in python3 does.

Python3's Header doesn't have this bug.
msg212107 - (view) Author: Rik (rednaw) Date: 2014-02-24 15:32
Ok, do you think there's any risk in making `base64mime.header_encode` handle `maxlinelen=None`? I think it would be more consistent if `base64mime.header_encode` and `quoprimime.header_encode` interpret their arguments similarly.
msg212111 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-24 16:06
Well, there's the usual API change risk: something that works on 2.7.x doesn't work on 2.7.x-1.  So since we can fix the bug without making the API change, I think we should.
msg212112 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-24 16:09
That wasn't clear.  By "something that works" I mean exactly what you are talking about: someone writing code using these functions would naturally try to use None with base64mime, and if we make it work, that would work fine in 2.7.x, but mysteriously break if run on an earlier version of 2.7.  So instead we force the author of new code to use a non-None value that will in fact work in previous versions of 2.7.
History
Date User Action Args
2014-02-24 16:09:08r.david.murraysetmessages: + msg212112
2014-02-24 16:06:48r.david.murraysetmessages: + msg212111
2014-02-24 15:32:17rednawsetmessages: + msg212107
2014-02-24 14:28:42r.david.murraysetmessages: + msg212101
2014-02-24 13:46:47rednawsetmessages: + msg212098
2014-02-24 13:26:05r.david.murraysetmessages: + msg212093
2014-02-24 09:02:34rednawsetmessages: + msg212072
2014-02-23 23:47:24r.david.murraysetmessages: + msg212045
2014-02-23 17:59:37rednawcreate