Author gagern
Recipients gagern, lemburg, yenzenz, zenzen
Date 2009-02-26.23:06:49
SpamBayes Score 1.89293e-14
Marked as misclassified No
Message-id <1235689614.98.0.418526753967.issue843590@psf.upfronthosting.co.za>
In-reply-to
Content
I did some further investigations here. Apple doesn't seem likely to
offer any authoritative reference for the "macintosh" encoding, because
all they ever seem to talk about is "Roman". The only source for
"macintosh" I could find is this RFC 1345, with the listed differences.
The RFC states the Unicode 1.0 standard as its source. Yesterday I went
to the library and thumbed through that volume. That, too, talks about
the different macintosh encodings, one of which is called "Roman" and
matches the one from current Unicode standards, except for 0xdb which
used to be the currency sign back then but is euro now. On 2009-02-09 I
also tried to ask Keld Simonsen, the author of the RFC, about this whole
issue. I got no reply so far.

On the whole, I get the impression that the "macintosh" encoding from
RFC 1345 is pretty much without actual use. I see no real world
application which actually uses it as it is defined, as most users
intend it as the IANA-registered name for mac-roman.

Python has two options, I believe. We could either do this by the book,
and implement an encoding as it was defined, even though there is no
known real world applicaton of that exact charset. Or we could be
pragmatic, and postulate that the RFC is simply wrong, and every real
world occurrence of "macintosh" intends to refer to mac-romand, in which
case an alias would be appropriate. I would say, let's be pragmatic.

When converting from unicode to macintosh, it might be possible to
accomodate both mappings, and in this way avoid unmappable characters.
As this doesn't deal well with the switched dashes, I guess I'd rather
not do this, in order to avoid subtle issues from going undetected. It
might be a good idea, however, to map both currecny sign and euro to the
same byte, and choose one when mapping back to unicode.

I don't think I can contribute much more information to this issue, and
seeing as it has been open for years without much input, I take it
neither will others. So I guess it is time to make a choice based on the
information available. By the book, or pragmatic?
History
Date User Action Args
2009-02-26 23:06:55gagernsetrecipients: + gagern, lemburg, zenzen, yenzenz
2009-02-26 23:06:54gagernsetmessageid: <1235689614.98.0.418526753967.issue843590@psf.upfronthosting.co.za>
2009-02-26 23:06:52gagernlinkissue843590 messages
2009-02-26 23:06:50gagerncreate