Message 82784 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gagern
Recipients	gagern, lemburg, yenzenz, zenzen
Date	2009-02-26.23:06:49
SpamBayes Score	1.8929303e-14
Marked as misclassified	No
Message-id	<1235689614.98.0.418526753967.issue843590@psf.upfronthosting.co.za>
In-reply-to

Content
I did some further investigations here. Apple doesn't seem likely to offer any authoritative reference for the "macintosh" encoding, because all they ever seem to talk about is "Roman". The only source for "macintosh" I could find is this RFC 1345, with the listed differences. The RFC states the Unicode 1.0 standard as its source. Yesterday I went to the library and thumbed through that volume. That, too, talks about the different macintosh encodings, one of which is called "Roman" and matches the one from current Unicode standards, except for 0xdb which used to be the currency sign back then but is euro now. On 2009-02-09 I also tried to ask Keld Simonsen, the author of the RFC, about this whole issue. I got no reply so far. On the whole, I get the impression that the "macintosh" encoding from RFC 1345 is pretty much without actual use. I see no real world application which actually uses it as it is defined, as most users intend it as the IANA-registered name for mac-roman. Python has two options, I believe. We could either do this by the book, and implement an encoding as it was defined, even though there is no known real world applicaton of that exact charset. Or we could be pragmatic, and postulate that the RFC is simply wrong, and every real world occurrence of "macintosh" intends to refer to mac-romand, in which case an alias would be appropriate. I would say, let's be pragmatic. When converting from unicode to macintosh, it might be possible to accomodate both mappings, and in this way avoid unmappable characters. As this doesn't deal well with the switched dashes, I guess I'd rather not do this, in order to avoid subtle issues from going undetected. It might be a good idea, however, to map both currecny sign and euro to the same byte, and choose one when mapping back to unicode. I don't think I can contribute much more information to this issue, and seeing as it has been open for years without much input, I take it neither will others. So I guess it is time to make a choice based on the information available. By the book, or pragmatic?

I did some further investigations here. Apple doesn't seem likely to
offer any authoritative reference for the "macintosh" encoding, because
all they ever seem to talk about is "Roman". The only source for
"macintosh" I could find is this RFC 1345, with the listed differences.
The RFC states the Unicode 1.0 standard as its source. Yesterday I went
to the library and thumbed through that volume. That, too, talks about
the different macintosh encodings, one of which is called "Roman" and
matches the one from current Unicode standards, except for 0xdb which
used to be the currency sign back then but is euro now. On 2009-02-09 I
also tried to ask Keld Simonsen, the author of the RFC, about this whole
issue. I got no reply so far.

On the whole, I get the impression that the "macintosh" encoding from
RFC 1345 is pretty much without actual use. I see no real world
application which actually uses it as it is defined, as most users
intend it as the IANA-registered name for mac-roman.

Python has two options, I believe. We could either do this by the book,
and implement an encoding as it was defined, even though there is no
known real world applicaton of that exact charset. Or we could be
pragmatic, and postulate that the RFC is simply wrong, and every real
world occurrence of "macintosh" intends to refer to mac-romand, in which
case an alias would be appropriate. I would say, let's be pragmatic.

When converting from unicode to macintosh, it might be possible to
accomodate both mappings, and in this way avoid unmappable characters.
As this doesn't deal well with the switched dashes, I guess I'd rather
not do this, in order to avoid subtle issues from going undetected. It
might be a good idea, however, to map both currecny sign and euro to the
same byte, and choose one when mapping back to unicode.

I don't think I can contribute much more information to this issue, and
seeing as it has been open for years without much input, I take it
neither will others. So I guess it is time to make a choice based on the
information available. By the book, or pragmatic?

History
Date	User	Action	Args
2009-02-26 23:06:55	gagern	set	recipients: + gagern, lemburg, zenzen, yenzenz
2009-02-26 23:06:54	gagern	set	messageid: <1235689614.98.0.418526753967.issue843590@psf.upfronthosting.co.za>
2009-02-26 23:06:52	gagern	link	issue843590 messages
2009-02-26 23:06:50	gagern	create