Author ncoghlan
Recipients Alexander.Belopolsky, Arfrever, MrJean1, ajaksu2, barry, benjamin.peterson, inducer, mark.dickinson, meador.inge, ncoghlan, noufal, pitrou, pv, skrah, teoliphant
Date 2012-08-11.14:56:57
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1344697018.99.0.234608946065.issue3132@psf.upfronthosting.co.za>
In-reply-to
Content
Following up here after rejecting #15622 as invalid

The "unicode" codes in PEP 3118 need to be seriously rethought before any related changes are made in the struct module.

1. The 'c' and 's' codes are currently used for raw bytes data (represented as bytes objects at the Python layer). This means the 'c' code cannot be used as described in PEP 3118 in a world with strict binary/text separation.

2. Any format codes for UCS1, UCS2 and UCS4 are more usefully modelled on 's' than they are on 'c' (so that repeat counts create longer strings rather than lists of strings that each contain a single code point)

3. Given some of the other proposals in PEP 3118, it seems more useful to define an embedded text format as "S{<encoding>}".

UCS1 would then be "S{latin-1}", UCS2 would be approximated as "S{utf-16}" and UCS4 would be "S{utf-32}" and arbitrary encodings would also be supported. struct packing would implicitly encode from text to bytes while unpacking would implicitly decode bytes to text. As with 's' a length mismatch in the encoded form would mean an error.
History
Date User Action Args
2012-08-11 14:56:59ncoghlansetrecipients: + ncoghlan, barry, teoliphant, mark.dickinson, pitrou, inducer, ajaksu2, MrJean1, benjamin.peterson, pv, Arfrever, noufal, skrah, meador.inge, Alexander.Belopolsky
2012-08-11 14:56:58ncoghlansetmessageid: <1344697018.99.0.234608946065.issue3132@psf.upfronthosting.co.za>
2012-08-11 14:56:58ncoghlanlinkissue3132 messages
2012-08-11 14:56:57ncoghlancreate