New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The email package should defer to the codecs module for all aliases #53144
Comments
Currently the email module maintains a set of "charset" aliases that it maps to codec names before looking up the codec in the codecs module. Ideally it should instead be able to just look up any 'charset' name, and if it is a valid alias for a codec, the codec module would return the codec with the canonical name. It is possible (I haven't checked yet) that the email module needs a different canonical 'charset' name for certain codecs, but if so it can do that mapping after getting the canonical codec name from codecs. To implement this we need to make two simple changes:
|
from email.charset.ALIASES most of them failed to be recognize by codecs module. >>> for i in email.charset.ALIASES.keys():
... try:
... codecs.lookup(i)
... except LookupError:
... print("Not recognized by codecs : alias {} mapped to {}".format(i, email.charset.ALIASES[i]))
...
...
Not recognized by codecs : alias latin-8 mapped to iso-8859-14
Not recognized by codecs : alias latin-9 mapped to iso-8859-15
Not recognized by codecs : alias latin-2 mapped to iso-8859-2
Not recognized by codecs : alias latin-3 mapped to iso-8859-3
<codecs.CodecInfo object for encoding iso8859-1 at 0x10160af58>
Not recognized by codecs : alias latin-6 mapped to iso-8859-10
Not recognized by codecs : alias latin-7 mapped to iso-8859-13
Not recognized by codecs : alias latin-4 mapped to iso-8859-4
Not recognized by codecs : alias latin-5 mapped to iso-8859-9
<codecs.CodecInfo object for encoding euc_jp at 0x1016260b8>
Not recognized by codecs : alias latin-10 mapped to iso-8859-16
<codecs.CodecInfo object for encoding ascii at 0x101626120>
Not recognized by codecs : alias latin_10 mapped to iso-8859-16
<codecs.CodecInfo object for encoding iso8859-1 at 0x10160aae0>
Not recognized by codecs : alias latin_2 mapped to iso-8859-2
Not recognized by codecs : alias latin_3 mapped to iso-8859-3
Not recognized by codecs : alias latin_4 mapped to iso-8859-4
Not recognized by codecs : alias latin_5 mapped to iso-8859-9
Not recognized by codecs : alias latin_6 mapped to iso-8859-10
Not recognized by codecs : alias latin_7 mapped to iso-8859-13
Not recognized by codecs : alias latin_8 mapped to iso-8859-14
Not recognized by codecs : alias latin_9 mapped to iso-8859-15
<codecs.CodecInfo object for encoding cp949 at 0x101626390>
<codecs.CodecInfo object for encoding euc_kr at 0x101626530> So basically apart from latin-1 all the latin* failed to be recognized by codecs. |
Shashwat Anand wrote:
>
> Shashwat Anand <anand.shashwat@gmail.com> added the comment:
>
> from email.charset.ALIASES most of them failed to be recognize by codecs module.
>
>
>>>> for i in email.charset.ALIASES.keys():
> ... try:
> ... codecs.lookup(i)
> ... except LookupError:
> ... print("Not recognized by codecs : alias {} mapped to {}".format(i, email.charset.ALIASES[i]))
> ...
> ...
> Not recognized by codecs : alias latin-8 mapped to iso-8859-14
> Not recognized by codecs : alias latin-9 mapped to iso-8859-15
> Not recognized by codecs : alias latin-2 mapped to iso-8859-2
> Not recognized by codecs : alias latin-3 mapped to iso-8859-3
> <codecs.CodecInfo object for encoding iso8859-1 at 0x10160af58>
> Not recognized by codecs : alias latin-6 mapped to iso-8859-10
> Not recognized by codecs : alias latin-7 mapped to iso-8859-13
> Not recognized by codecs : alias latin-4 mapped to iso-8859-4
> Not recognized by codecs : alias latin-5 mapped to iso-8859-9
> <codecs.CodecInfo object for encoding euc_jp at 0x1016260b8>
> Not recognized by codecs : alias latin-10 mapped to iso-8859-16
> <codecs.CodecInfo object for encoding ascii at 0x101626120>
> Not recognized by codecs : alias latin_10 mapped to iso-8859-16
> <codecs.CodecInfo object for encoding iso8859-1 at 0x10160aae0>
> Not recognized by codecs : alias latin_2 mapped to iso-8859-2
> Not recognized by codecs : alias latin_3 mapped to iso-8859-3
> Not recognized by codecs : alias latin_4 mapped to iso-8859-4
> Not recognized by codecs : alias latin_5 mapped to iso-8859-9
> Not recognized by codecs : alias latin_6 mapped to iso-8859-10
> Not recognized by codecs : alias latin_7 mapped to iso-8859-13
> Not recognized by codecs : alias latin_8 mapped to iso-8859-14
> Not recognized by codecs : alias latin_9 mapped to iso-8859-15
> <codecs.CodecInfo object for encoding cp949 at 0x101626390>
> <codecs.CodecInfo object for encoding euc_kr at 0x101626530>
>
>
> So basically apart from latin-1 all the latin* failed to be recognized by codecs. We need to add aliases for those codecs. The current aliases |
latinN means latin1 to latin10 ? >>> codecs.lookup('latin_1')
<codecs.CodecInfo object for encoding iso8859-1 at 0x10160aae0> |
Shashwat Anand wrote:
Yes. We should add aliases for the format "latin_N" as well.
Yes, since that's the native name of the dedicated Python codec |
Too late for 3.2, will implement for 3.3. |
The attached patch adds aliases for latin_N in encodings.aliases, and fixes email.charset behaviour according to codecs.lookup, as requested. Am I supposed to add any unittest? I'm wavering about where they should be placed (in encodings or email?). |
The patch looks ok to me. Regarding the tests, I don't see tests for the aliases anywhere, so something like: |
Well, actually encodings.aliases links to the encoding _module name_, as I've also slightly changed the imports as PEP-8 says: No: import sys, os Anyway, running the test failed for two encodings, there are two bugs there,
Since they are really easy to fix, I haven't yet reported to the bugtraker. |
Michele Orrù wrote:
mbcs is only available on Windows.
I'm not sure what happened here: either the alias entry is wrong In either case, no one has complained about this encoding not working, |
So, what do you prefer? Add a check for sys.platform, or just skip it? discussion on python-dev. So I'm +1 for just skipping it for now (with a XXX |
Sorry, I was told that email the bugtracker could not work properly.
I don't have such autority, and probably such a choice will require a discussion on python-dev. So I'm +1 for just skipping it for now (with a XXX comment on the right maybe). |
Michele Orrù wrote:
>
> Michele Orrù <maker.py@gmail.com> added the comment:
>
> Sorry, I was told that email the bugtracker could not work properly.
>
>
>>> - mcbs has something broken in its imports;
>
>> mbcs is only available on Windows.
>
> So, what do you prefer? Add a check for sys.platform, or just skip it? The test suite provides ways to implement known failures on If that's too difficult, just use sys.platform.
Given the old discussion on the other ticket, it's fine to
|
unittest.skip* are decorators, so useless in this case; also, AFAIS I would suggest to put a try statement in encodings.mbcs, and raise an |
Something like: |
I suggest to:
|
Ezio Melotti wrote:
+1 |
In the sense that the alias for 'tactis' should be removed also in 2.7 and 3.2? |
euc_jp and euc_kr seem to be backward (that is, codecs translates them to the _ version, instead of translating the _ version to the - version). I worry that there might be other deviations from the standard email names. I would suggest we pull the list of preferred MIME names from the IANA charset registry and make a test out of them in the email package. If changing the name returned by codecs is determined to not be acceptable, then those entries will need to remain in the charset module ALIASES table and the codecs-check logic adjusted accordingly. Unfortunately the IANA registry does not list MIME names for all of the charsets in common use, and the canonical names are not always the ones commonly used in email. Hopefully the codecs registry is using the most common name for those, and hopefully if there are differences it won't break any user code, since any reasonable email code should be coping with the aliases in any case. Ezio, if you want to steal this one from me, that's fine by me. |
Hmm. Must have misread. Looks like all the common charsets do have MIME entries in the IANA table. |
On second thought the resolution order ought to be swapped anyway: if the user has added an ALIAS, they are going to want that used, not the one from codecs. |
R. David Murray wrote:
The way I understand the patch was that the email package will |
Well, it turns out that back when I opened this issue I misunderstood what the ALIASES table was used for. it *is* used before doing a codecs lookup, but it is also used to convert whatever charset name the programmer specifies into the standard MIME name for the codec when generating emails. Clearly the email module needs to base its transformation on the IANA table. I think the ideal would be to have a program that pulls the IANA table and generates the ALIASES table. On the other hand, codecs should already have all of those aliases (this theoretical program could be used to ensure that), so another alternative is to use codecs to look up the "python canonical" name for the charset, and have the email ALIASES table just map the ones where that isn't the preferred MIME name into the MIME name. |
After discussing on IRC, it figured out that the best choice would be to use normalize_encoding plus ALIAS, as the attached patch does. |
What is not-a-charset? I apparently misunderstood what normalize_encodings does. It isn't doing a lookup in the codecs registry and returning the canonical name for the codec. Does that mean we actually have to fetch the codec in order to get the canonical name? I suspect so, and that is probably OK, since in most cases the codec is eventually going to get called while processing the email that triggered the ALIASES lookup. I also notice that there is a table of aliases in the codec module documentation, so that will need to be updated as well. |
R. David Murray wrote:
As far as the aliases.py part of the patch goes, I'm fine with that Regarding using this table in the email package, I'm not really If you are looking for a way to determine whether Python has a codec If you want to avoid the actual codec module import (codecs.lookup() If you want to convert an arbitrary encoding name to a registered You'd have to add a new mime_alias map to the email package Hope that helps. |
+1 What do you think? Ezio, David? |
Well, my thought was to avoid having multiple charset alias lists in the stdlib, and reusing the one in codecs, which is larger than the one in email, seemed to make sense. This came up because a bug was reported where email (silently) failed to encode a string because the charset alias, while present in codecs, wasn't present in the email ALIASES table. I suppose that as an alternative I could add full support for the IANA aliases list to email. Email is the most likely place to run in to variant charset aliases anyway. If that's the way we go, then this issue should be changed over to covering just updating codecs with the missing aliases, and a new issue opened for adding full IANA alias support to email. |
In that case, I could still take care of it; it would be really easy to do. So, it's up to you to tell me what is the best design choice. (: |
R. David Murray wrote:
I think it would be useful to have a mapping from the Python http://www.iana.org/assignments/character-sets This mapping could also be added to the encodings package Note that we don't support all the aliases mentioned in the IANA Since we only rarely get requests for supporting new aliases or |
I agree that since we get very few requests to add aliases our current tables are probably what we want. So adding the MIME_preferred_name mapping *somewhere* is indeed what I would like to see happen. It doesn't matter to me whether it is in the codecs module or the email module. |
Any idea about how to unittest mime.aliases? Also, since I've just created a new file, are there some buracratic issues? I mean, do I have to add something at the top of the file? |
Michele Orrù wrote:
Test the APIs you probably created for accessing it.
You just need to put the usual copyright line at the top of Apart from that, you also need to make sure that the other build |
Your new file isn't in the patch. I'm imagining it is a table and a couple methods, so I think perhaps putting it either in charset or in utils would be better than creating a new file. As for testing it, what I'd love to see is a test that downloads the current IANA table (there are routines in test.support for doing this in a way that respects the test suite's 'resources' settings), pulls out the preferred MIME aliases, and makes sure that all of them are mapped to some canonical Python codec. Then you can invert that and make sure all of the results returned by that test map back to the correct MIME alias. |
Prompted on IRC, I see I missed the file because it was so short. This still isn't what I'm looking for. We are assuming that email is going to use the codec eventually so that it is not a bad thing to have charset pre-populate the codec cache. So what I'm looking for is:
MAL's idea was to implement the ALIASES step via a two-way mapping in the encodings module (python-canonical-name <=> MIME-preferred-name). That would be fine, too, but the email.charset logic should look like the above however the table is implemented. |
The second line in that try: block should have been: mime_name = ALIASES.get(python_name, python_name) |
Is this silent error another bug to fix? |
Not in email5. The RFC says that if the charset parameter isn't known you just pass it through. In email6 we will be making a more careful distinction between errors that should be passed silently per the RFC, and ones that should be noisy because the API in question is being used to create the message ab-initio. (In email5 the exact same machinery is used to create a message from parsed source as is used to create a message programatically, resulting in the silent passing of certain errors that should really be noisy.) |
This issue is not newcomer friendly, I remove the easy keyword. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: