Message 135785 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	cdqzzy, ezio.melotti, lemburg, terry.reedy, vstinner
Date	2011-05-11.17:05:27
SpamBayes Score	3.1099012e-12
Marked as misclassified	No
Message-id	<1305133530.72.0.884287770563.issue12057@psf.upfronthosting.co.za>
In-reply-to

Content
Looking at cjkencodings.py the format is pretty clear. The file consists of one statement that creates one dict that maps encoding names to a pair of (encoded) byte strings. The bytes literals are entirely hex escapes, with a maximum of 16 per chunk (line). From the usage you deduced that the first is encoded with named encoding and the second encoded with utf-8. (For anyone wondering, a separate utf-8 strings is needed for each encoding because each other encoding is limited to a different subset of unicode chars.) So I am not completely convinced that pulling the file apart is a complete win. Another entry could be added (the file is formatted with that possibility in mind), but it would certainly be much easier if the original formatting program were available. I do have a couple of questions. 1. Did one of us create the test strings (if so, how) or do they come from an authoritative source (like the unicode site) that created and checked them with their reference implementations. If so, the missing pair is a puzzle. Anyway, if so, is there any possibility that we would need to get new test strings from that source? Or are the limitations of these coding definitely fixed. 2. If you create a test file for hz codec with the hz codec, how do we know it is correct? It would only serve to detect changes in the future.

Looking at cjkencodings.py the format is pretty clear. The file consists of one statement that creates one dict that maps encoding names to a pair of (encoded) byte strings. The bytes literals are entirely hex escapes, with a maximum of 16 per chunk (line). From the usage you deduced that the first is encoded with named encoding and the second encoded with utf-8. (For anyone wondering, a separate utf-8 strings is needed for each encoding because each other encoding is limited to a different subset of unicode chars.)

So I am not completely convinced that pulling the file apart is a complete win. Another entry could be added (the file is formatted with that possibility in mind), but it would certainly be much easier if the original formatting program were available. I do have a couple of questions.

1. Did one of us create the test strings (if so, how) or do they come from an authoritative source (like the unicode site) that created and checked them with their reference implementations. If so, the missing pair *is* a puzzle. Anyway, if so, is there any possibility that we would need to get new test strings from that source? Or are the limitations of these coding definitely fixed.

2. If you create a test file for hz codec with the hz codec, how do we know it is correct? It would only serve to detect changes in the future.

History
Date	User	Action	Args
2011-05-11 17:05:30	terry.reedy	set	recipients: + terry.reedy, lemburg, vstinner, ezio.melotti, cdqzzy
2011-05-11 17:05:30	terry.reedy	set	messageid: <1305133530.72.0.884287770563.issue12057@psf.upfronthosting.co.za>
2011-05-11 17:05:28	terry.reedy	link	issue12057 messages
2011-05-11 17:05:27	terry.reedy	create