Message 323996 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Michael.Felt
Recipients	Michael.Felt, lukasz.langa, vstinner
Date	2018-08-24.10:28:43
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<9e38ed86-5f28-efc0-8916-7f1d8479db16@felt.demon.nl>
In-reply-to	<9a2bf6f5-2270-2900-3955-d53557cc140f@felt.demon.nl>

Content
On 23/08/2018 19:14, Michael Felt wrote: > Michael Felt <aixtools@felt.demon.nl> added the comment: > > On 23/08/2018 12:51, STINNER Victor wrote: >> STINNER Victor <vstinner@redhat.com> added the comment: >> >> Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale(). > This is beyond my understanding atm. > Early on I tried making the expected just be 'arg' and went from > situation A to situation B - which looked much closer, BUT, the 'types' > differed: > > Situaltion A (original) > AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']" > - ['h\xc3\xa9\xe2\x82\xac'] > + ['h\udcc3\udca9\udce2\udc82\udcac'] > : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] > > I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such). > > Situation B > AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']" > - ['h\xc3\xa9\xe2\x82\xac'] > + [b'h\xc3\xa9\xe2\x82\xac'] > ? + > : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] > > After further digging - to understand why it was coming as "\x encoding rather than \udc" > > I looked at what was happening here: > > out = self.get_output('-X', utf8_opt, '-c', code, arg, kw) > becomes > out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), kw) > becomes > out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', kw) > > And finally, at the CLI becomes: > ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'] > > /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar > gv[1:])))', b'h\xc3\xa9\xe2\x82\xac' > UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac'] > > /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. > argv[1:])))', b'h\xc3\xa9\xe2\x82\xac' > ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac'] > > Note: > /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. > argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac' > ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac'] > > /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. > argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac' > ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac'] > > root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (> > UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac'] > > Summary: > a) concerned about how b'h....' becomes 'bh....' > b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from > self.get_output('-X', utf8_opt, '-c', code, arg, kw) > determines the output and the (failed) comparison. p.s. also tried: michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', 'h\xe9\u20ac'.encode\('utf-8'\) ISO8859-1:['h\\xe9\\u20ac.encode(utf-8)'] michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', 'h\xe9\u20ac'.encode\('utf-8'\) UTF-8:['h\\xe9\\u20ac.encode(utf-8)'] Really unclear to me what this test is trying to verify. The CLI seems to just 'echo' what it is provided. >>> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252) >> Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed. >> >> ---------- >> >> _______________________________________ >> Python tracker <report@bugs.python.org> >> <https://bugs.python.org/issue34347> >> _______________________________________ >> > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue34347> > _______________________________________ >

On 23/08/2018 19:14, Michael Felt wrote:
> Michael Felt <aixtools@felt.demon.nl> added the comment:
>
> On 23/08/2018 12:51, STINNER Victor wrote:
>> STINNER Victor <vstinner@redhat.com> added the comment:
>>
>> Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().
> This is beyond my understanding atm.
> Early on I tried making the expected just be 'arg' and went from
> situation A to situation B - which looked much closer, BUT, the 'types'
> differed:
>
> Situaltion A (original)
> AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
> - ['h\xc3\xa9\xe2\x82\xac']
> + ['h\udcc3\udca9\udce2\udc82\udcac']
>  : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']
>
> I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).
>
> Situation B
> AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"
> - ['h\xc3\xa9\xe2\x82\xac']
> + [b'h\xc3\xa9\xe2\x82\xac']
> ?  +
>  : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']
>
> After further digging - to understand why it was coming as "\x encoding rather than \udc"
>
> I looked at what was happening here:
>
> out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
> becomes
> out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)
> becomes
> out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)
>
> And finally, at the CLI becomes:
> ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
>
> /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar
> gv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
> UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']
>
> /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
> argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
> ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']
>
> Note:
> /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
> argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac'
> ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac']
>
> /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
> argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac'
> ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']
>
> root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (>
> UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']
>
> Summary:
> a) concerned about how b'h....' becomes 'bh....'
> b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from 
> self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
>  determines the output and the (failed) comparison.
p.s. also tried:
michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python
'-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys;
print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))',
'h\xe9\u20ac'.encode\('utf-8'\)
ISO8859-1:['h\\xe9\\u20ac.encode(utf-8)']
michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python
'-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys;
print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))',
'h\xe9\u20ac'.encode\('utf-8'\)
UTF-8:['h\\xe9\\u20ac.encode(utf-8)']

Really unclear to me what this test is trying to verify. The CLI seems
to just 'echo' what it is provided.
>>> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)
>> Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.
>>
>> ----------
>>
>> _______________________________________
>> Python tracker <report@bugs.python.org>
>> <https://bugs.python.org/issue34347>
>> _______________________________________
>>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue34347>
> _______________________________________
>

History
Date	User	Action	Args
2018-08-24 10:28:43	Michael.Felt	set	recipients: + Michael.Felt, vstinner, lukasz.langa
2018-08-24 10:28:43	Michael.Felt	link	issue34347 messages
2018-08-24 10:28:43	Michael.Felt	create