Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The email package should defer to the codecs module for all aliases #53144

Open
bitdancer opened this issue Jun 4, 2010 · 39 comments
Open

The email package should defer to the codecs module for all aliases #53144

bitdancer opened this issue Jun 4, 2010 · 39 comments
Labels
stdlib Python modules in the Lib dir topic-email type-feature A feature request or enhancement

Comments

@bitdancer
Copy link
Member

BPO 8898
Nosy @malemburg, @warsaw, @vstinner, @ezio-melotti, @merwok, @bitdancer, @mmaker
Files
  • issue8898.patch
  • fail_tactis.txt
  • issue8898_withtests.patch
  • fail_mcbs.txt
  • issue8898_skip.patch
  • issue8898_normalize.patch
  • issue8898_2.patch
  • issue8898_3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2010-06-04.18:53:49.968>
    labels = ['type-feature', 'expert-email']
    title = 'The email package should defer to the codecs module for\tall aliases'
    updated_at = <Date 2019-07-29.12:01:12.918>
    user = 'https://github.com/bitdancer'

    bugs.python.org fields:

    activity = <Date 2019-07-29.12:01:12.918>
    actor = 'vstinner'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['email']
    creation = <Date 2010-06-04.18:53:49.968>
    creator = 'r.david.murray'
    dependencies = []
    files = ['22053', '22058', '22059', '22060', '22066', '22094', '22146', '22153']
    hgrepos = []
    issue_num = 8898
    keywords = ['patch']
    message_count = 39.0
    messages = ['107087', '107093', '107098', '107100', '107102', '124713', '136443', '136488', '136507', '136511', '136514', '136515', '136518', '136519', '136520', '136521', '136533', '136539', '136550', '136551', '136553', '136614', '136636', '136764', '136984', '136989', '136994', '136996', '136998', '136999', '137007', '137048', '137049', '137051', '137056', '137060', '137072', '137082', '348647']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'barry', 'vstinner', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'l0nwlf', 'maker']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'test needed'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue8898'
    versions = ['Python 3.3']

    @bitdancer
    Copy link
    Member Author

    Currently the email module maintains a set of "charset" aliases that it maps to codec names before looking up the codec in the codecs module. Ideally it should instead be able to just look up any 'charset' name, and if it is a valid alias for a codec, the codec module would return the codec with the canonical name. It is possible (I haven't checked yet) that the email module needs a different canonical 'charset' name for certain codecs, but if so it can do that mapping after getting the canonical codec name from codecs.

    To implement this we need to make two simple changes:

    1. add any aliases the email module recognizes but the codecs module doesn't to the codecs module.

    2. rewrite email.charset so that it does not have an ALIASES table (but may have a smaller 'canonical charset map' table instead).

    @bitdancer bitdancer self-assigned this Jun 4, 2010
    @bitdancer bitdancer added stdlib Python modules in the Lib dir easy type-feature A feature request or enhancement labels Jun 4, 2010
    @l0nwlf
    Copy link
    Mannequin

    l0nwlf mannequin commented Jun 4, 2010

    from email.charset.ALIASES most of them failed to be recognize by codecs module.

    >>> for i in email.charset.ALIASES.keys():
    ...     try:
    ...         codecs.lookup(i)
    ...     except LookupError:
    ...         print("Not recognized by codecs : alias {} mapped to {}".format(i, email.charset.ALIASES[i]))
    ...     
    ... 
    Not recognized by codecs : alias latin-8 mapped to iso-8859-14
    Not recognized by codecs : alias latin-9 mapped to iso-8859-15
    Not recognized by codecs : alias latin-2 mapped to iso-8859-2
    Not recognized by codecs : alias latin-3 mapped to iso-8859-3
    <codecs.CodecInfo object for encoding iso8859-1 at 0x10160af58>
    Not recognized by codecs : alias latin-6 mapped to iso-8859-10
    Not recognized by codecs : alias latin-7 mapped to iso-8859-13
    Not recognized by codecs : alias latin-4 mapped to iso-8859-4
    Not recognized by codecs : alias latin-5 mapped to iso-8859-9
    <codecs.CodecInfo object for encoding euc_jp at 0x1016260b8>
    Not recognized by codecs : alias latin-10 mapped to iso-8859-16
    <codecs.CodecInfo object for encoding ascii at 0x101626120>
    Not recognized by codecs : alias latin_10 mapped to iso-8859-16
    <codecs.CodecInfo object for encoding iso8859-1 at 0x10160aae0>
    Not recognized by codecs : alias latin_2 mapped to iso-8859-2
    Not recognized by codecs : alias latin_3 mapped to iso-8859-3
    Not recognized by codecs : alias latin_4 mapped to iso-8859-4
    Not recognized by codecs : alias latin_5 mapped to iso-8859-9
    Not recognized by codecs : alias latin_6 mapped to iso-8859-10
    Not recognized by codecs : alias latin_7 mapped to iso-8859-13
    Not recognized by codecs : alias latin_8 mapped to iso-8859-14
    Not recognized by codecs : alias latin_9 mapped to iso-8859-15
    <codecs.CodecInfo object for encoding cp949 at 0x101626390>
    <codecs.CodecInfo object for encoding euc_kr at 0x101626530>

    So basically apart from latin-1 all the latin* failed to be recognized by codecs.

    @malemburg
    Copy link
    Member

    Shashwat Anand wrote:
    > 
    > Shashwat Anand <anand.shashwat@gmail.com> added the comment:
    > 
    > from email.charset.ALIASES most of them failed to be recognize by codecs module.
    > 
    > 
    >>>> for i in email.charset.ALIASES.keys():
    > ...     try:
    > ...         codecs.lookup(i)
    > ...     except LookupError:
    > ...         print("Not recognized by codecs : alias {} mapped to {}".format(i, email.charset.ALIASES[i]))
    > ...     
    > ... 
    > Not recognized by codecs : alias latin-8 mapped to iso-8859-14
    > Not recognized by codecs : alias latin-9 mapped to iso-8859-15
    > Not recognized by codecs : alias latin-2 mapped to iso-8859-2
    > Not recognized by codecs : alias latin-3 mapped to iso-8859-3
    > <codecs.CodecInfo object for encoding iso8859-1 at 0x10160af58>
    > Not recognized by codecs : alias latin-6 mapped to iso-8859-10
    > Not recognized by codecs : alias latin-7 mapped to iso-8859-13
    > Not recognized by codecs : alias latin-4 mapped to iso-8859-4
    > Not recognized by codecs : alias latin-5 mapped to iso-8859-9
    > <codecs.CodecInfo object for encoding euc_jp at 0x1016260b8>
    > Not recognized by codecs : alias latin-10 mapped to iso-8859-16
    > <codecs.CodecInfo object for encoding ascii at 0x101626120>
    > Not recognized by codecs : alias latin_10 mapped to iso-8859-16
    > <codecs.CodecInfo object for encoding iso8859-1 at 0x10160aae0>
    > Not recognized by codecs : alias latin_2 mapped to iso-8859-2
    > Not recognized by codecs : alias latin_3 mapped to iso-8859-3
    > Not recognized by codecs : alias latin_4 mapped to iso-8859-4
    > Not recognized by codecs : alias latin_5 mapped to iso-8859-9
    > Not recognized by codecs : alias latin_6 mapped to iso-8859-10
    > Not recognized by codecs : alias latin_7 mapped to iso-8859-13
    > Not recognized by codecs : alias latin_8 mapped to iso-8859-14
    > Not recognized by codecs : alias latin_9 mapped to iso-8859-15
    > <codecs.CodecInfo object for encoding cp949 at 0x101626390>
    > <codecs.CodecInfo object for encoding euc_kr at 0x101626530>
    > 
    > 
    > So basically apart from latin-1 all the latin* failed to be recognized by codecs.

    We need to add aliases for those codecs. The current aliases
    list only supports the format "latinN" for N in 1-10.

    @malemburg malemburg changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases Jun 4, 2010
    @l0nwlf
    Copy link
    Mannequin

    l0nwlf mannequin commented Jun 4, 2010

    We need to add aliases for those codecs. The current aliases
    list only supports the format "latinN" for N in 1-10.

    latinN means latin1 to latin10 ?
    But latin_1 is a recognized alias.

    >>> codecs.lookup('latin_1')
    <codecs.CodecInfo object for encoding iso8859-1 at 0x10160aae0>

    @malemburg
    Copy link
    Member

    Shashwat Anand wrote:

    Shashwat Anand <anand.shashwat@gmail.com> added the comment:

    > We need to add aliases for those codecs. The current aliases
    > list only supports the format "latinN" for N in 1-10.

    latinN means latin1 to latin10 ?

    Yes. We should add aliases for the format "latin_N" as well.

    But latin_1 is a recognized alias.

    >>> codecs.lookup('latin_1')
    <codecs.CodecInfo object for encoding iso8859-1 at 0x10160aae0>

    Yes, since that's the native name of the dedicated Python codec
    for ISO-8859-1.

    @bitdancer
    Copy link
    Member Author

    Too late for 3.2, will implement for 3.3.

    @bitdancer bitdancer changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases Dec 27, 2010
    @mmaker
    Copy link
    Mannequin

    mmaker mannequin commented May 21, 2011

    The attached patch adds aliases for latin_N in encodings.aliases, and fixes email.charset behaviour according to codecs.lookup, as requested.
    Tested on (Arch) Linux.

    Am I supposed to add any unittest? I'm wavering about where they should be placed (in encodings or email?).

    @ezio-melotti
    Copy link
    Member

    The patch looks ok to me.
    AFAIU the lookup will take care to normalize the name and return latin_N. This also implies that other names (like 'latin-N', 'LaTiN~~N' and so on) will be normalized to latin_N and then accepted.

    Regarding the tests, I don't see tests for the aliases anywhere, so something like:
    for alias, codec_name in encodings.aliases.items():
    self.assertEqual(codecs.lookup(alias).name, codec_name)
    could be added somewhere to check that all the aliases in the dict map to the correct codec.

    @mmaker
    Copy link
    Mannequin

    mmaker mannequin commented May 22, 2011

    Well, actually encodings.aliases links to the encoding _module name_, as
    described in the doc:
    """ Encoding Aliases Support
    This module is used by the encodings package search function to
    map encodings names to module names.
    """
    So I've adjusted your snippet according to this, as you can see in the
    attachment.

    I've also slightly changed the imports as PEP-8 says:
    """
    Yes: import os
    import sys

    No: import sys, os
    """

    Anyway, running the test failed for two encodings, there are two bugs there,
    indeed.

    • mcbs has something broken in its imports;
    • tactis module is not present.

    Since they are really easy to fix, I haven't yet reported to the bugtraker.
    Let me know what should I do.
    Post on bugs.python.org bug and patch? Any new test specifically for the
    email module?

    @malemburg
    Copy link
    Member

    Michele Orrù wrote:

    Michele Orrù <maker.py@gmail.com> added the comment:

    Well, actually encodings.aliases links to the encoding _module name_, as
    described in the doc:
    """ Encoding Aliases Support
    This module is used by the encodings package search function to
    map encodings names to module names.
    """
    So I've adjusted your snippet according to this, as you can see in the
    attachment.

    I've also slightly changed the imports as PEP-8 says:
    """
    Yes: import os
    import sys

    No: import sys, os
    """

    Anyway, running the test failed for two encodings, there are two bugs there,
    indeed.

    • mcbs has something broken in its imports;

    mbcs is only available on Windows.

    • tactis module is not present.

    I'm not sure what happened here: either the alias entry is wrong
    or the codec module was not committed.

    In either case, no one has complained about this encoding not working,
    so we can probably just remove it from the alias table. See
    http://bugs.python.org/issue1251921 for a similar report and
    discussion.

    @malemburg malemburg changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases May 22, 2011
    @mmaker
    Copy link
    Mannequin

    mmaker mannequin commented May 22, 2011

    So, what do you prefer? Add a check for sys.platform, or just skip it?

    discussion on python-dev. So I'm +1 for just skipping it for now (with a XXX
    comment on the right maybe).

    @mmaker mmaker mannequin changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases May 22, 2011
    @mmaker
    Copy link
    Mannequin

    mmaker mannequin commented May 22, 2011

    Sorry, I was told that email the bugtracker could not work properly.

    > - mcbs has something broken in its imports;

    mbcs is only available on Windows.
    So, what do you prefer? Add a check for sys.platform, or just skip it?

    > - tactis module is not present.

    I'm not sure what happened here: either the alias entry is wrong
    or the codec module was not committed.

    In either case, no one has complained about this encoding not working,
    so we can probably just remove it from the alias table. See
    http://bugs.python.org/issue1251921 for a similar report and
    discussion.

    I don't have such autority, and probably such a choice will require a discussion on python-dev. So I'm +1 for just skipping it for now (with a XXX comment on the right maybe).

    @malemburg
    Copy link
    Member

    Michele Orrù wrote:
    > 
    > Michele Orrù <maker.py@gmail.com> added the comment:
    > 
    > Sorry, I was told that email the bugtracker could not work properly.
    > 
    > 
    >>> - mcbs has something broken in its imports;
    > 
    >> mbcs is only available on Windows.
    >
    > So, what do you prefer? Add a check for sys.platform, or just skip it?

    The test suite provides ways to implement known failures on
    specific platforms, so I'd suggest to use those mechanisms.
    I've never used those, so can't comment on how much work it is
    to use them.

    If that's too difficult, just use sys.platform.

    >> - tactis module is not present.

    > I'm not sure what happened here: either the alias entry is wrong
    > or the codec module was not committed.

    > In either case, no one has complained about this encoding not working,
    > so we can probably just remove it from the alias table. See
    > http://bugs.python.org/issue1251921 for a similar report and
    > discussion.

    I don't have such autority, and probably such a choice will require a discussion on python-dev. So I'm +1 for just skipping it for now (with a XXX comment on the right maybe).

    Given the old discussion on the other ticket, it's fine to
    remove the alias entry:

    # tactis codec
    'tis260'             : 'tactis',
    

    @malemburg malemburg changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases May 22, 2011
    @mmaker
    Copy link
    Mannequin

    mmaker mannequin commented May 22, 2011

    unittest.skip* are decorators, so useless in this case; also, AFAIS
    Lib/test/ uses sys.platform.

    I would suggest to put a try statement in encodings.mbcs, and raise an
    error in case the imported modules imported are not found.
    But this is another story.

    @mmaker mmaker mannequin changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases May 22, 2011
    @ezio-melotti
    Copy link
    Member

    Something like:
    if name == 'mbcs' and not sys.platform.startswith('win'):
    continue
    should be enough.

    @ezio-melotti
    Copy link
    Member

    I suggest to:

    1. remove the alias for tactis;

    2. add the aliases for latin_* and the tests for the aliases;

    3. fix the email.charset to use the new aliases instead of its own dict.

    4. and 3) should go on 3.3 only, 1) could be considered a bug and fixed on 2.7/3.2 too, but since the codec is already missing, removing the alias won't change anything (i.e. it will raise a LookupError with or without alias).

    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    I suggest to:

    1. remove the alias for tactis;

    2. add the aliases for latin_* and the tests for the aliases;

    3. fix the email.charset to use the new aliases instead of its own dict.

    4. and 3) should go on 3.3 only, 1) could be considered a bug and fixed on 2.7/3.2 too, but since the codec is already missing, removing the alias won't change anything (i.e. it will raise a LookupError with or without alias).

    +1

    @malemburg malemburg changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases May 22, 2011
    @mmaker
    Copy link
    Mannequin

    mmaker mannequin commented May 22, 2011

    In the sense that the alias for 'tactis' should be removed also in 2.7 and 3.2?

    @mmaker mmaker mannequin changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases May 22, 2011
    @bitdancer
    Copy link
    Member Author

    euc_jp and euc_kr seem to be backward (that is, codecs translates them to the _ version, instead of translating the _ version to the - version). I worry that there might be other deviations from the standard email names. I would suggest we pull the list of preferred MIME names from the IANA charset registry and make a test out of them in the email package. If changing the name returned by codecs is determined to not be acceptable, then those entries will need to remain in the charset module ALIASES table and the codecs-check logic adjusted accordingly.

    Unfortunately the IANA registry does not list MIME names for all of the charsets in common use, and the canonical names are not always the ones commonly used in email. Hopefully the codecs registry is using the most common name for those, and hopefully if there are differences it won't break any user code, since any reasonable email code should be coping with the aliases in any case.

    Ezio, if you want to steal this one from me, that's fine by me.

    @bitdancer
    Copy link
    Member Author

    Hmm. Must have misread. Looks like all the common charsets do have MIME entries in the IANA table.

    @bitdancer
    Copy link
    Member Author

    On second thought the resolution order ought to be swapped anyway: if the user has added an ALIAS, they are going to want that used, not the one from codecs.

    @malemburg
    Copy link
    Member

    R. David Murray wrote:

    R. David Murray <rdmurray@bitdance.com> added the comment:

    euc_jp and euc_kr seem to be backward (that is, codecs translates them to the _ version, instead of translating the _ version to the - version). I worry that there might be other deviations from the standard email names. I would suggest we pull the list of preferred MIME names from the IANA charset registry and make a test out of them in the email package. If changing the name returned by codecs is determined to not be acceptable, then those entries will need to remain in the charset module ALIASES table and the codecs-check logic adjusted accordingly.

    Unfortunately the IANA registry does not list MIME names for all of the charsets in common use, and the canonical names are not always the ones commonly used in email. Hopefully the codecs registry is using the most common name for those, and hopefully if there are differences it won't break any user code, since any reasonable email code should be coping with the aliases in any case.

    The way I understand the patch was that the email package will
    start to use the encoding aliases for determining the codec
    name instead of its own list. That is: only for decoding the
    input data, not for creating a correct MIME encoding name in
    output data.

    @malemburg malemburg changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases May 23, 2011
    @bitdancer
    Copy link
    Member Author

    Well, it turns out that back when I opened this issue I misunderstood what the ALIASES table was used for. it *is* used before doing a codecs lookup, but it is also used to convert whatever charset name the programmer specifies into the standard MIME name for the codec when generating emails.

    Clearly the email module needs to base its transformation on the IANA table. I think the ideal would be to have a program that pulls the IANA table and generates the ALIASES table. On the other hand, codecs should already have all of those aliases (this theoretical program could be used to ensure that), so another alternative is to use codecs to look up the "python canonical" name for the charset, and have the email ALIASES table just map the ones where that isn't the preferred MIME name into the MIME name.

    @mmaker
    Copy link
    Mannequin

    mmaker mannequin commented May 24, 2011

    After discussing on IRC, it figured out that the best choice would be to use normalize_encoding plus ALIAS, as the attached patch does.

    @bitdancer
    Copy link
    Member Author

    What is not-a-charset?

    I apparently misunderstood what normalize_encodings does. It isn't doing a lookup in the codecs registry and returning the canonical name for the codec. Does that mean we actually have to fetch the codec in order to get the canonical name? I suspect so, and that is probably OK, since in most cases the codec is eventually going to get called while processing the email that triggered the ALIASES lookup.

    I also notice that there is a table of aliases in the codec module documentation, so that will need to be updated as well.

    @malemburg
    Copy link
    Member

    R. David Murray wrote:

    R. David Murray <rdmurray@bitdance.com> added the comment:

    What is not-a-charset?

    I apparently misunderstood what normalize_encodings does. It isn't doing a lookup in the codecs registry and returning the canonical name for the codec. Does that mean we actually have to fetch the codec in order to get the canonical name? I suspect so, and that is probably OK, since in most cases the codec is eventually going to get called while processing the email that triggered the ALIASES lookup.

    I also notice that there is a table of aliases in the codec module documentation, so that will need to be updated as well.

    As far as the aliases.py part of the patch goes, I'm fine with that
    since it corrects a few real bugs and adds the missing Latin-N
    codec names.

    Regarding using this table in the email package, I'm not really
    clear on what you want to achieve.

    If you are looking for a way to determine whether Python has a codec
    installed for a certain charset name, then codecs.lookup() will
    tell you this (and it also applies all the aliasing and normalization
    needed).

    If you want to avoid the actual codec module import (codecs.lookup()
    imports the module), you can mimic the logic used by the lookup function
    of the encodings package. Not sure, whether that's worth it, though,
    since it is rather likely that you're going to use the codec you've
    just looked up soon after the test and codecs.lookup() caches the
    found codecs.

    If you want to convert an arbitrary encoding name to a registered
    standard IANA MIME charset name, then the aliases.py module is not
    going to be of much help, since we are using our own canonical
    names which do not necessarily map to the MIME charset names.

    You'd have to add a new mime_alias map to the email package
    for that. I'd suggest to use the same approach as for the
    aliases.py module, which is to first normalize the encoding
    name using normalize_encoding() and then running that through
    the mime_alias map.

    Hope that helps.

    @mmaker
    Copy link
    Mannequin

    mmaker mannequin commented May 26, 2011

    +1

    What do you think? Ezio, David?

    @mmaker mmaker mannequin changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases May 26, 2011
    @bitdancer
    Copy link
    Member Author

    Well, my thought was to avoid having multiple charset alias lists in the stdlib, and reusing the one in codecs, which is larger than the one in email, seemed to make sense. This came up because a bug was reported where email (silently) failed to encode a string because the charset alias, while present in codecs, wasn't present in the email ALIASES table.

    I suppose that as an alternative I could add full support for the IANA aliases list to email. Email is the most likely place to run in to variant charset aliases anyway.

    If that's the way we go, then this issue should be changed over to covering just updating codecs with the missing aliases, and a new issue opened for adding full IANA alias support to email.

    @mmaker
    Copy link
    Mannequin

    mmaker mannequin commented May 26, 2011

    In that case, I could still take care of it; it would be really easy to do.

    So, it's up to you to tell me what is the best design choice. (:

    @malemburg
    Copy link
    Member

    R. David Murray wrote:

    R. David Murray <rdmurray@bitdance.com> added the comment:

    Well, my thought was to avoid having multiple charset alias lists in the stdlib, and reusing the one in codecs, which is larger than the one in email, seemed to make sense. This came up because a bug was reported where email (silently) failed to encode a string because the charset alias, while present in codecs, wasn't present in the email ALIASES table.

    I suppose that as an alternative I could add full support for the IANA aliases list to email. Email is the most likely place to run in to variant charset aliases anyway.

    If that's the way we go, then this issue should be changed over to covering just updating codecs with the missing aliases, and a new issue opened for adding full IANA alias support to email.

    I think it would be useful to have a mapping from the Python
    canoncial name (the one the encodings package uses) to the
    "preferred MIME name" as referenced in the IANA list:

    http://www.iana.org/assignments/character-sets

    This mapping could also be added to the encodings package
    together with a function that translates a given encoding
    name to its canoncial Python name (codec_module_name())
    and another one to translate it to the "preferred MIME name"
    according to the above list (encoding_mime_name()).

    Note that we don't support all the aliases mentioned in the IANA
    list because many of the are outdated and some have proved to be
    wrong (the aliased encodings are actually different in a few
    places). There are also a few encodings in the list which we
    don't support at all.

    Since we only rarely get requests for supporting new aliases or
    encodings, I think it's safe to say that the existing set
    is fairly complete from a practical point of view.

    @malemburg malemburg changed the title The email package should defer to the codecs module for all aliases The email package should defer to the codecs module for all aliases May 26, 2011
    @bitdancer
    Copy link
    Member Author

    I agree that since we get very few requests to add aliases our current tables are probably what we want. So adding the MIME_preferred_name mapping *somewhere* is indeed what I would like to see happen. It doesn't matter to me whether it is in the codecs module or the email module.

    @mmaker
    Copy link
    Mannequin

    mmaker mannequin commented May 27, 2011

    Any idea about how to unittest mime.aliases?

    Also, since I've just created a new file, are there some buracratic issues? I mean, do I have to add something at the top of the file?
    (I'm just signing the Contributor Agreement)

    @malemburg
    Copy link
    Member

    Michele Orrù wrote:

    Michele Orrù <maker.py@gmail.com> added the comment:

    Any idea about how to unittest mime.aliases?

    Test the APIs you probably created for accessing it.

    Also, since I've just created a new file, are there some buracratic issues? I mean, do I have to add something at the top of the file?
    (I'm just signing the Contributor Agreement)

    You just need to put the usual copyright line at the top of
    the file, together with the sentence from the agreement.

    Apart from that, you also need to make sure that the other build
    setups include the new file (PCbuild, Makefile.pre.in, etc.). If you
    don't know how to do this, you can ask someone else to take
    care of this, since it usually requires domain knowledge (e.g.
    to add the file to the Windows builds).

    @bitdancer
    Copy link
    Member Author

    Your new file isn't in the patch. I'm imagining it is a table and a couple methods, so I think perhaps putting it either in charset or in utils would be better than creating a new file.

    As for testing it, what I'd love to see is a test that downloads the current IANA table (there are routines in test.support for doing this in a way that respects the test suite's 'resources' settings), pulls out the preferred MIME aliases, and makes sure that all of them are mapped to some canonical Python codec. Then you can invert that and make sure all of the results returned by that test map back to the correct MIME alias.

    @bitdancer
    Copy link
    Member Author

    Prompted on IRC, I see I missed the file because it was so short.

    This still isn't what I'm looking for. We are assuming that email is going to use the codec eventually so that it is not a bad thing to have charset pre-populate the codec cache. So what I'm looking for is:

    try:
        python_name = codecs.lookup(input_charset).name
        mime_name = ALIASES.get(python_name, input_charset)
    except LookupError:
        mime_name = input_charset
    

    MAL's idea was to implement the ALIASES step via a two-way mapping in the encodings module (python-canonical-name <=> MIME-preferred-name). That would be fine, too, but the email.charset logic should look like the above however the table is implemented.

    @bitdancer
    Copy link
    Member Author

    The second line in that try: block should have been:

      mime_name = ALIASES.get(python_name, python_name)

    @merwok
    Copy link
    Member

    merwok commented May 27, 2011

    email (silently) failed to encode a string

    Is this silent error another bug to fix?

    @bitdancer
    Copy link
    Member Author

    Not in email5. The RFC says that if the charset parameter isn't known you just pass it through. In email6 we will be making a more careful distinction between errors that should be passed silently per the RFC, and ones that should be noisy because the API in question is being used to create the message ab-initio. (In email5 the exact same machinery is used to create a message from parsed source as is used to create a message programatically, resulting in the silent passing of certain errors that should really be noisy.)

    @bitdancer bitdancer added topic-email and removed stdlib Python modules in the Lib dir labels May 24, 2012
    @bitdancer bitdancer removed their assignment May 24, 2012
    @vstinner
    Copy link
    Member

    This issue is not newcomer friendly, I remove the easy keyword.

    @vstinner vstinner removed the easy label Jul 29, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 23, 2023
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-email type-feature A feature request or enhancement
    Projects
    Development

    No branches or pull requests

    6 participants