Author gagern
Recipients gagern, lemburg, yenzenz, zenzen
Date 2009-02-08.18:56:02
SpamBayes Score 1.67793e-09
Marked as misclassified No
Message-id <1234119365.89.0.970525337917.issue843590@psf.upfronthosting.co.za>
In-reply-to
Content
I had my first indication to rather use "macintosh" instead of
"mac_roman" from Wikipedia http://en.wikipedia.org/wiki/Mac_OS_Roman
which states that the charset part of a MIME content-type specification
should be maciontosh. I'm not quoting this as any kind of authority, but
rather to point out that it is likely for people to use this.

I did a comparison of http://tools.ietf.org/rfc/rfc1345.txt (RFC) and
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT (UNI)
using the attached perl script. The results:
3 codepoints unused in RFC but defined in UNI: f0, f6, f7
1 codepoint unused in UNI but defined in RFC: 7f
2 codepoints with slightly different character names, same meaning
9 codepoints with actually different definitions:

 a5: rfc 2219 BULLET OPERATOR
     uni 2022 BULLET
 c4: rfc e023 DUTCH GUILDER SIGN (IBM437 159)
     uni 0192 LATIN SMALL LETTER F WITH HOOK
 c6: rfc 0394 GREEK CAPITAL LETTER DELTA
     uni 2206 INCREMENT
 c9: rfc 22ef MIDLINE HORIZONTAL ELLIPSIS
     uni 2026 HORIZONTAL ELLIPSIS
 d0: rfc 2014 EM DASH
     uni 2013 EN DASH
 d1: rfc 2013 EN DASH
     uni 2014 EM DASH
 d7: rfc 25c6 BLACK DIAMOND
     uni 25ca LOZENGE
 db: rfc 00a4 CURRENCY SIGN
     uni 20ac EURO SIGN
 f8: rfc 203e OVERLINE
     uni 00af MACRON

a5 and c6 could be different interpretations of symbols that look pretty
much the same. The introduction of the euro sign instead of the generic
currency sign seems to be a recent modification documented in UNI. The
change of the order of the dashes seems really confusing.

Notice also this line in the RFC:
&rem source: The Unicode Standard ver1.0, ISBN 0-201-56788-1, Oct 1991
So it looks like the RFC used the unicode definition as its source. What
part of it I'm not sure, and where the differences come I'm even less sure.

My next steps:
* Look for further references, e.g. from apple, and compare them as well
* Try some things out on a mac, see how it behaves in real life
* Compare all this to the current python implementation
* Write a patch to either provide an alias or a new charset "macintosh"
Help welcome.
History
Date User Action Args
2009-02-08 18:56:06gagernsetrecipients: + gagern, lemburg, zenzen, yenzenz
2009-02-08 18:56:05gagernsetmessageid: <1234119365.89.0.970525337917.issue843590@psf.upfronthosting.co.za>
2009-02-08 18:56:05gagernlinkissue843590 messages
2009-02-08 18:56:03gagerncreate