Author terry.reedy
Recipients cdqzzy, ezio.melotti, lemburg, terry.reedy, vstinner
Date 2011-05-11.17:05:27
SpamBayes Score 3.1099e-12
Marked as misclassified No
Message-id <>
Looking at the format is pretty clear. The file consists of one statement that creates one dict that maps encoding names to a pair of (encoded) byte strings. The bytes literals are entirely hex escapes, with a maximum of 16 per chunk (line). From the usage you deduced that the first is encoded with named encoding and the second encoded with utf-8. (For anyone wondering, a separate utf-8 strings is needed for each encoding because each other encoding is limited to a different subset of unicode chars.)

So I am not completely convinced that pulling the file apart is a complete win. Another entry could be added (the file is formatted with that possibility in mind), but it would certainly be much easier if the original formatting program were available. I do have a couple of questions.

1. Did one of us create the test strings (if so, how) or do they come from an authoritative source (like the unicode site) that created and checked them with their reference implementations. If so, the missing pair *is* a puzzle. Anyway, if so, is there any possibility that we would need to get new test strings from that source? Or are the limitations of these coding definitely fixed.

2. If you create a test file for hz codec with the hz codec, how do we know it is correct? It would only serve to detect changes in the future.
Date User Action Args
2011-05-11 17:05:30terry.reedysetrecipients: + terry.reedy, lemburg, vstinner, ezio.melotti, cdqzzy
2011-05-11 17:05:30terry.reedysetmessageid: <>
2011-05-11 17:05:28terry.reedylinkissue12057 messages
2011-05-11 17:05:27terry.reedycreate