classification
Title: 'macintosh' encoding alias for 'mac_roman'
Type: enhancement Stage: patch review
Components: Unicode Versions: Python 3.2
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, benjamin.peterson, gagern, lemburg, ned.deily, yenzenz, zenzen
Priority: normal Keywords: easy, patch

Created on 2003-11-17 09:29 by zenzen, last changed 2010-09-02 23:13 by ned.deily. This issue is now closed.

Files
File name Uploaded Description Edit
compare.pl gagern, 2009-02-08 18:56 Script to compare charset definitions.
issue843590_rfc.patch gagern, 2010-01-15 19:23 encoding as the RFC defines it
issue843590_alias.patch gagern, 2010-01-15 19:36 macintosh as alias to mac_roman
Messages (18)
msg61134 - (view) Author: Stuart Bishop (zenzen) Date: 2003-11-17 09:29
OS X's Mail.app can generate Subject lines like:
Subject: =?MACINTOSH?B?vLu7vMGqo6KwpKalu7w=?=

(Which decodes to 
'\xbc\xbb\xbb\xbc\xc1\xaa\xa3\xa2\xb0\xa4\xa6\xa5\xbb\xb
c')

This appears to be what Python calls the mac_roman
encoding. I suggest adding 'macintosh' as an alias to
'mac_roman' to encodings/aliases.py to allow the email
package to decode these headers.
msg61135 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-11-17 10:12
Logged In: YES 
user_id=38388

Are you sure ? The decoded string you give does not look
like anything readable...
msg61136 - (view) Author: Stuart Bishop (zenzen) Date: 2003-11-17 10:47
Logged In: YES 
user_id=46639

The test was just a sequence of random high-bit characters:

ºªªº¡™£¢?§¶•ªº

(lets see if the web interface lets that through).
msg61137 - (view) Author: Jens Klein (yenzenz) Date: 2004-12-18 22:49
Logged In: YES 
user_id=474612

+1 from me

Archetypes (a Zope framework) runs also in a problem because of the 
missing alias.

more infos:
https://sourceforge.net/tracker/index.php?
func=detail&aid=1068001&group_id=75272&atid=543430
msg61138 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-18 23:01
Logged In: YES 
user_id=38388

I have no problem adding aliases to the encodings package,
but please provide some reference that this actually is a
valid alias for the mac_roman encoding. There are quite a
few other mac_* encodings to choose from as well, so the
coice is not obvious to me.
msg61139 - (view) Author: Jens Klein (yenzenz) Date: 2004-12-19 20:09
Logged In: YES 
user_id=474612

seems its a bit more difficult:
encoding 'macintosh is registered by iana[1] (nice formatted in [2]) and is 
covered by RFC1345[3].

Name: macintosh [RFC1345,KXS2]
MIBenum: 2027
Source: The Unicode Standard ver1.0, ISBN 0-201-56788-1, Oct 1991
Alias: mac
Alias: csMacintosh

[1]http://www.iana.org/assignments/character-sets
[2]http://www.cs.tut.fi/~jkorpela/chars/sorted.html
[3]http://www.faqs.org/rfcs/rfc1345.html

so far the hard facts from specification view. in all these specs are 
mac_roman etc. not mentioned. So what?

I found at [4] with the popular program 'recode' a hint of the alias. the aothor 
there uses the iana registered macintosh as an alias for mac_roman:

DEFENCODING(( "MacRoman",               /* JDK 1.1 */
              /* This is the best table for MACINTOSH. The ones */
              /* in glibc and FreeBSD-iconv are bad quality. */
              "MACINTOSH",              /* IANA */
              "MAC",                    /* IANA */
              "csMacintosh",            /* IANA */
            ),
            mac_roman,
            { mac_roman_mbtowc },         { mac_roman_wctomb, NULL })

[4]http://recode.progiciels-bpi.ca/showfile.html?name=fusion/recode-3.6/
libiconv/encodings.def

Because of that (I trust recode somehow) i would propose to add macintosh 
as an alias for mac_roman.
msg61140 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-20 10:38
Logged In: YES 
user_id=38388

Thanks for the research. Since the "macintosh" character set
is defined in the RFC 1345 and the mac_roman encoding in
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT
could you compare the two and check whether they are in fact
the same mapping ?

Note: Aliases for mappings are often implemented in a rather
careless way - we want to make sure that things we alias are
indeed correct aliases. Otherwise it's would be better to
add a new codec for 'macintosh'.

Thanks.
msg81407 - (view) Author: Martin von Gagern (gagern) Date: 2009-02-08 18:56
I had my first indication to rather use "macintosh" instead of
"mac_roman" from Wikipedia http://en.wikipedia.org/wiki/Mac_OS_Roman
which states that the charset part of a MIME content-type specification
should be maciontosh. I'm not quoting this as any kind of authority, but
rather to point out that it is likely for people to use this.

I did a comparison of http://tools.ietf.org/rfc/rfc1345.txt (RFC) and
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT (UNI)
using the attached perl script. The results:
3 codepoints unused in RFC but defined in UNI: f0, f6, f7
1 codepoint unused in UNI but defined in RFC: 7f
2 codepoints with slightly different character names, same meaning
9 codepoints with actually different definitions:

 a5: rfc 2219 BULLET OPERATOR
     uni 2022 BULLET
 c4: rfc e023 DUTCH GUILDER SIGN (IBM437 159)
     uni 0192 LATIN SMALL LETTER F WITH HOOK
 c6: rfc 0394 GREEK CAPITAL LETTER DELTA
     uni 2206 INCREMENT
 c9: rfc 22ef MIDLINE HORIZONTAL ELLIPSIS
     uni 2026 HORIZONTAL ELLIPSIS
 d0: rfc 2014 EM DASH
     uni 2013 EN DASH
 d1: rfc 2013 EN DASH
     uni 2014 EM DASH
 d7: rfc 25c6 BLACK DIAMOND
     uni 25ca LOZENGE
 db: rfc 00a4 CURRENCY SIGN
     uni 20ac EURO SIGN
 f8: rfc 203e OVERLINE
     uni 00af MACRON

a5 and c6 could be different interpretations of symbols that look pretty
much the same. The introduction of the euro sign instead of the generic
currency sign seems to be a recent modification documented in UNI. The
change of the order of the dashes seems really confusing.

Notice also this line in the RFC:
&rem source: The Unicode Standard ver1.0, ISBN 0-201-56788-1, Oct 1991
So it looks like the RFC used the unicode definition as its source. What
part of it I'm not sure, and where the differences come I'm even less sure.

My next steps:
* Look for further references, e.g. from apple, and compare them as well
* Try some things out on a mac, see how it behaves in real life
* Compare all this to the current python implementation
* Write a patch to either provide an alias or a new charset "macintosh"
Help welcome.
msg82784 - (view) Author: Martin von Gagern (gagern) Date: 2009-02-26 23:06
I did some further investigations here. Apple doesn't seem likely to
offer any authoritative reference for the "macintosh" encoding, because
all they ever seem to talk about is "Roman". The only source for
"macintosh" I could find is this RFC 1345, with the listed differences.
The RFC states the Unicode 1.0 standard as its source. Yesterday I went
to the library and thumbed through that volume. That, too, talks about
the different macintosh encodings, one of which is called "Roman" and
matches the one from current Unicode standards, except for 0xdb which
used to be the currency sign back then but is euro now. On 2009-02-09 I
also tried to ask Keld Simonsen, the author of the RFC, about this whole
issue. I got no reply so far.

On the whole, I get the impression that the "macintosh" encoding from
RFC 1345 is pretty much without actual use. I see no real world
application which actually uses it as it is defined, as most users
intend it as the IANA-registered name for mac-roman.

Python has two options, I believe. We could either do this by the book,
and implement an encoding as it was defined, even though there is no
known real world applicaton of that exact charset. Or we could be
pragmatic, and postulate that the RFC is simply wrong, and every real
world occurrence of "macintosh" intends to refer to mac-romand, in which
case an alias would be appropriate. I would say, let's be pragmatic.

When converting from unicode to macintosh, it might be possible to
accomodate both mappings, and in this way avoid unmappable characters.
As this doesn't deal well with the switched dashes, I guess I'd rather
not do this, in order to avoid subtle issues from going undetected. It
might be a good idea, however, to map both currecny sign and euro to the
same byte, and choose one when mapping back to unicode.

I don't think I can contribute much more information to this issue, and
seeing as it has been open for years without much input, I take it
neither will others. So I guess it is time to make a choice based on the
information available. By the book, or pragmatic?
msg97837 - (view) Author: Martin von Gagern (gagern) Date: 2010-01-15 19:23
Find attached (issue843590_rfc.patch) an implementation of the macintosh encoding as the RFC defines it. I don't suggest its inclusion; I would prefer the alias of this implementation, but either one is better than no 'macintosh' encoding at all. So if you really want that, here it is.
msg97840 - (view) Author: Martin von Gagern (gagern) Date: 2010-01-15 19:36
And this patch (issue84359_alias.patch) is the alternative, 'macintosh' as an alias to 'mac_roman' as originally requested, along with a bunch of aliases registered with IANA. I'd prefer this approach over the preceding one, and hope someone will maybe review this for inclusion.
msg98005 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-18 11:58
Here's another reference I found:

http://developer.apple.com/legacy/mac/library/documentation/mac/Text/Text-30.html

It appears that the "macintosh" encoding is the same as the MacRoman one, but without the character D9-FF. The document also suggests that it's a really old encoding.

Here's a comparison of various Mac Roman mappings:

http://www.haible.de/bruno/charsets/conversion-tables/Mac-Roman.html

These include the "macintosh" charset name as well.

For all practical purposes, it appears to be safe to alias "macintosh" to "mac-roman" and also add the other suggested aliases from the IANA registry.
msg114297 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-18 23:16
@Marc-Andre as there's no comments since your last post would you like to take this forward, cheers.
msg114410 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-08-19 20:56
Mark Lawrence wrote:
> 
> Mark Lawrence <breamoreboy@yahoo.co.uk> added the comment:
> 
> @Marc-Andre as there's no comments since your last post would you like to take this forward, cheers.

I'm fine with adding the alias, but currently don't have any cycles
left to actually do the checkins, add the Misc/NEWS entry, update
the docs, etc.
msg114475 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-08-21 02:55
r84229
msg114481 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-08-21 09:40
Benjamin Peterson wrote:
> 
> Benjamin Peterson <benjamin@python.org> added the comment:
> 
> r84229

Thanks, Benjamin !
msg115205 - (view) Author: Martin von Gagern (gagern) Date: 2010-08-30 11:11
Maybe I'm missing something here, but r84229 looks to me like aliasing 'macintosh' to itself, instead of to 'mac_roman'. 'csmacintosh' and 'mac' are not included at all, without any comment as to why they have been omitted. Makes me wonder why my issue843590_alias.patch wasn't applied as it is, but recreated instead.
msg115408 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-09-02 23:13
Martin, the typo was fixed subsequently by r84231.
History
Date User Action Args
2010-09-02 23:13:08ned.deilysetnosy: + ned.deily
messages: + msg115408
2010-08-30 11:11:46gagernsetmessages: + msg115205
2010-08-21 09:40:23lemburgsetmessages: + msg114481
2010-08-21 02:55:00benjamin.petersonsetstatus: open -> closed
nosy: + benjamin.peterson
messages: + msg114475

2010-08-20 17:54:08amaury.forgeotdarcsetkeywords: + easy
resolution: accepted
2010-08-19 20:56:13lemburgsetmessages: + msg114410
2010-08-18 23:16:07BreamoreBoysetversions: + Python 3.2
nosy: + BreamoreBoy

messages: + msg114297

stage: patch review
2010-01-18 11:58:32lemburgsetmessages: + msg98005
2010-01-15 19:36:54gagernsetfiles: + issue843590_alias.patch

messages: + msg97840
2010-01-15 19:23:51gagernsetfiles: + issue843590_rfc.patch
keywords: + patch
messages: + msg97837
2009-02-26 23:06:52gagernsetmessages: + msg82784
2009-02-08 18:56:05gagernsetfiles: + compare.pl
nosy: + gagern
messages: + msg81407
2003-11-17 09:29:00zenzencreate