Issue843590
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2003-11-17 09:29 by zenzen, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
compare.pl | gagern, 2009-02-08 18:56 | Script to compare charset definitions. | ||
issue843590_rfc.patch | gagern, 2010-01-15 19:23 | encoding as the RFC defines it | ||
issue843590_alias.patch | gagern, 2010-01-15 19:36 | macintosh as alias to mac_roman |
Messages (18) | |||
---|---|---|---|
msg61134 - (view) | Author: Stuart Bishop (zenzen) | Date: 2003-11-17 09:29 | |
OS X's Mail.app can generate Subject lines like: Subject: =?MACINTOSH?B?vLu7vMGqo6KwpKalu7w=?= (Which decodes to '\xbc\xbb\xbb\xbc\xc1\xaa\xa3\xa2\xb0\xa4\xa6\xa5\xbb\xb c') This appears to be what Python calls the mac_roman encoding. I suggest adding 'macintosh' as an alias to 'mac_roman' to encodings/aliases.py to allow the email package to decode these headers. |
|||
msg61135 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2003-11-17 10:12 | |
Logged In: YES user_id=38388 Are you sure ? The decoded string you give does not look like anything readable... |
|||
msg61136 - (view) | Author: Stuart Bishop (zenzen) | Date: 2003-11-17 10:47 | |
Logged In: YES user_id=46639 The test was just a sequence of random high-bit characters: ºªªº¡™£¢?§¶•ªº (lets see if the web interface lets that through). |
|||
msg61137 - (view) | Author: Jens Klein (yenzenz) | Date: 2004-12-18 22:49 | |
Logged In: YES user_id=474612 +1 from me Archetypes (a Zope framework) runs also in a problem because of the missing alias. more infos: https://sourceforge.net/tracker/index.php? func=detail&aid=1068001&group_id=75272&atid=543430 |
|||
msg61138 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-12-18 23:01 | |
Logged In: YES user_id=38388 I have no problem adding aliases to the encodings package, but please provide some reference that this actually is a valid alias for the mac_roman encoding. There are quite a few other mac_* encodings to choose from as well, so the coice is not obvious to me. |
|||
msg61139 - (view) | Author: Jens Klein (yenzenz) | Date: 2004-12-19 20:09 | |
Logged In: YES user_id=474612 seems its a bit more difficult: encoding 'macintosh is registered by iana[1] (nice formatted in [2]) and is covered by RFC1345[3]. Name: macintosh [RFC1345,KXS2] MIBenum: 2027 Source: The Unicode Standard ver1.0, ISBN 0-201-56788-1, Oct 1991 Alias: mac Alias: csMacintosh [1]http://www.iana.org/assignments/character-sets [2]http://www.cs.tut.fi/~jkorpela/chars/sorted.html [3]http://www.faqs.org/rfcs/rfc1345.html so far the hard facts from specification view. in all these specs are mac_roman etc. not mentioned. So what? I found at [4] with the popular program 'recode' a hint of the alias. the aothor there uses the iana registered macintosh as an alias for mac_roman: DEFENCODING(( "MacRoman", /* JDK 1.1 */ /* This is the best table for MACINTOSH. The ones */ /* in glibc and FreeBSD-iconv are bad quality. */ "MACINTOSH", /* IANA */ "MAC", /* IANA */ "csMacintosh", /* IANA */ ), mac_roman, { mac_roman_mbtowc }, { mac_roman_wctomb, NULL }) [4]http://recode.progiciels-bpi.ca/showfile.html?name=fusion/recode-3.6/ libiconv/encodings.def Because of that (I trust recode somehow) i would propose to add macintosh as an alias for mac_roman. |
|||
msg61140 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-12-20 10:38 | |
Logged In: YES user_id=38388 Thanks for the research. Since the "macintosh" character set is defined in the RFC 1345 and the mac_roman encoding in ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT could you compare the two and check whether they are in fact the same mapping ? Note: Aliases for mappings are often implemented in a rather careless way - we want to make sure that things we alias are indeed correct aliases. Otherwise it's would be better to add a new codec for 'macintosh'. Thanks. |
|||
msg81407 - (view) | Author: Martin von Gagern (gagern) | Date: 2009-02-08 18:56 | |
I had my first indication to rather use "macintosh" instead of "mac_roman" from Wikipedia http://en.wikipedia.org/wiki/Mac_OS_Roman which states that the charset part of a MIME content-type specification should be maciontosh. I'm not quoting this as any kind of authority, but rather to point out that it is likely for people to use this. I did a comparison of http://tools.ietf.org/rfc/rfc1345.txt (RFC) and ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT (UNI) using the attached perl script. The results: 3 codepoints unused in RFC but defined in UNI: f0, f6, f7 1 codepoint unused in UNI but defined in RFC: 7f 2 codepoints with slightly different character names, same meaning 9 codepoints with actually different definitions: a5: rfc 2219 BULLET OPERATOR uni 2022 BULLET c4: rfc e023 DUTCH GUILDER SIGN (IBM437 159) uni 0192 LATIN SMALL LETTER F WITH HOOK c6: rfc 0394 GREEK CAPITAL LETTER DELTA uni 2206 INCREMENT c9: rfc 22ef MIDLINE HORIZONTAL ELLIPSIS uni 2026 HORIZONTAL ELLIPSIS d0: rfc 2014 EM DASH uni 2013 EN DASH d1: rfc 2013 EN DASH uni 2014 EM DASH d7: rfc 25c6 BLACK DIAMOND uni 25ca LOZENGE db: rfc 00a4 CURRENCY SIGN uni 20ac EURO SIGN f8: rfc 203e OVERLINE uni 00af MACRON a5 and c6 could be different interpretations of symbols that look pretty much the same. The introduction of the euro sign instead of the generic currency sign seems to be a recent modification documented in UNI. The change of the order of the dashes seems really confusing. Notice also this line in the RFC: &rem source: The Unicode Standard ver1.0, ISBN 0-201-56788-1, Oct 1991 So it looks like the RFC used the unicode definition as its source. What part of it I'm not sure, and where the differences come I'm even less sure. My next steps: * Look for further references, e.g. from apple, and compare them as well * Try some things out on a mac, see how it behaves in real life * Compare all this to the current python implementation * Write a patch to either provide an alias or a new charset "macintosh" Help welcome. |
|||
msg82784 - (view) | Author: Martin von Gagern (gagern) | Date: 2009-02-26 23:06 | |
I did some further investigations here. Apple doesn't seem likely to offer any authoritative reference for the "macintosh" encoding, because all they ever seem to talk about is "Roman". The only source for "macintosh" I could find is this RFC 1345, with the listed differences. The RFC states the Unicode 1.0 standard as its source. Yesterday I went to the library and thumbed through that volume. That, too, talks about the different macintosh encodings, one of which is called "Roman" and matches the one from current Unicode standards, except for 0xdb which used to be the currency sign back then but is euro now. On 2009-02-09 I also tried to ask Keld Simonsen, the author of the RFC, about this whole issue. I got no reply so far. On the whole, I get the impression that the "macintosh" encoding from RFC 1345 is pretty much without actual use. I see no real world application which actually uses it as it is defined, as most users intend it as the IANA-registered name for mac-roman. Python has two options, I believe. We could either do this by the book, and implement an encoding as it was defined, even though there is no known real world applicaton of that exact charset. Or we could be pragmatic, and postulate that the RFC is simply wrong, and every real world occurrence of "macintosh" intends to refer to mac-romand, in which case an alias would be appropriate. I would say, let's be pragmatic. When converting from unicode to macintosh, it might be possible to accomodate both mappings, and in this way avoid unmappable characters. As this doesn't deal well with the switched dashes, I guess I'd rather not do this, in order to avoid subtle issues from going undetected. It might be a good idea, however, to map both currecny sign and euro to the same byte, and choose one when mapping back to unicode. I don't think I can contribute much more information to this issue, and seeing as it has been open for years without much input, I take it neither will others. So I guess it is time to make a choice based on the information available. By the book, or pragmatic? |
|||
msg97837 - (view) | Author: Martin von Gagern (gagern) | Date: 2010-01-15 19:23 | |
Find attached (issue843590_rfc.patch) an implementation of the macintosh encoding as the RFC defines it. I don't suggest its inclusion; I would prefer the alias of this implementation, but either one is better than no 'macintosh' encoding at all. So if you really want that, here it is. |
|||
msg97840 - (view) | Author: Martin von Gagern (gagern) | Date: 2010-01-15 19:36 | |
And this patch (issue84359_alias.patch) is the alternative, 'macintosh' as an alias to 'mac_roman' as originally requested, along with a bunch of aliases registered with IANA. I'd prefer this approach over the preceding one, and hope someone will maybe review this for inclusion. |
|||
msg98005 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2010-01-18 11:58 | |
Here's another reference I found: http://developer.apple.com/legacy/mac/library/documentation/mac/Text/Text-30.html It appears that the "macintosh" encoding is the same as the MacRoman one, but without the character D9-FF. The document also suggests that it's a really old encoding. Here's a comparison of various Mac Roman mappings: http://www.haible.de/bruno/charsets/conversion-tables/Mac-Roman.html These include the "macintosh" charset name as well. For all practical purposes, it appears to be safe to alias "macintosh" to "mac-roman" and also add the other suggested aliases from the IANA registry. |
|||
msg114297 - (view) | Author: Mark Lawrence (BreamoreBoy) * | Date: 2010-08-18 23:16 | |
@Marc-Andre as there's no comments since your last post would you like to take this forward, cheers. |
|||
msg114410 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2010-08-19 20:56 | |
Mark Lawrence wrote: > > Mark Lawrence <breamoreboy@yahoo.co.uk> added the comment: > > @Marc-Andre as there's no comments since your last post would you like to take this forward, cheers. I'm fine with adding the alias, but currently don't have any cycles left to actually do the checkins, add the Misc/NEWS entry, update the docs, etc. |
|||
msg114475 - (view) | Author: Benjamin Peterson (benjamin.peterson) * | Date: 2010-08-21 02:55 | |
r84229 |
|||
msg114481 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2010-08-21 09:40 | |
Benjamin Peterson wrote: > > Benjamin Peterson <benjamin@python.org> added the comment: > > r84229 Thanks, Benjamin ! |
|||
msg115205 - (view) | Author: Martin von Gagern (gagern) | Date: 2010-08-30 11:11 | |
Maybe I'm missing something here, but r84229 looks to me like aliasing 'macintosh' to itself, instead of to 'mac_roman'. 'csmacintosh' and 'mac' are not included at all, without any comment as to why they have been omitted. Makes me wonder why my issue843590_alias.patch wasn't applied as it is, but recreated instead. |
|||
msg115408 - (view) | Author: Ned Deily (ned.deily) * | Date: 2010-09-02 23:13 | |
Martin, the typo was fixed subsequently by r84231. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:01 | admin | set | github: 39560 |
2010-09-02 23:13:08 | ned.deily | set | nosy:
+ ned.deily messages: + msg115408 |
2010-08-30 11:11:46 | gagern | set | messages: + msg115205 |
2010-08-21 09:40:23 | lemburg | set | messages: + msg114481 |
2010-08-21 02:55:00 | benjamin.peterson | set | status: open -> closed nosy: + benjamin.peterson messages: + msg114475 |
2010-08-20 17:54:08 | amaury.forgeotdarc | set | keywords:
+ easy resolution: accepted |
2010-08-19 20:56:13 | lemburg | set | messages: + msg114410 |
2010-08-18 23:16:07 | BreamoreBoy | set | versions:
+ Python 3.2 nosy: + BreamoreBoy messages: + msg114297 stage: patch review |
2010-01-18 11:58:32 | lemburg | set | messages: + msg98005 |
2010-01-15 19:36:54 | gagern | set | files:
+ issue843590_alias.patch messages: + msg97840 |
2010-01-15 19:23:51 | gagern | set | files:
+ issue843590_rfc.patch keywords: + patch messages: + msg97837 |
2009-02-26 23:06:52 | gagern | set | messages: + msg82784 |
2009-02-08 18:56:05 | gagern | set | files:
+ compare.pl nosy: + gagern messages: + msg81407 |
2003-11-17 09:29:00 | zenzen | create |