Message 207872 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	arigo, benjamin.peterson, serhiy.storchaka, terry.reedy
Date	2014-01-10.18:22:34
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1389378155.4.0.678381113068.issue20115@psf.upfronthosting.co.za>
In-reply-to

Content
Python should have a uniform definition of 'Python source' in both the doc and in practice in all source code processing functions. Currently, "2. Lexical analysis" in the Language Manual just says "Python reads program text as Unicode code points; the encoding of a source file can be given by an encoding declaration and defaults to UTF-8." UTF-8 encodes code point U+0000 as a null byte and this code point is nowhere excluded in the doc. (The definition of string literals uses 'source character' without any additional specification, so I take it to mean 'Unicode code point'.) If U+0000 is a legal 'source character', it, as with other control chars not given special meaning, should be a SyntaxError unless occurring in a comment or string literal. Eval and exec exclude even the latter with TypeError: source code string cannot contain null bytes If null bytes are legal, this is wrong. Simply truncating lines as done by the CPython parser is wrong whether not not U+0000 is legal. The simplest change would be to change the parser to match exec and add " other than U+000" after "Unicode code points" in the sentence quoted above.

Python should have a uniform definition of 'Python source' in both the doc and in practice in all source code processing functions. Currently, "2. Lexical analysis" in the Language Manual just says "Python reads program text as Unicode code points; the encoding of a source file can be given by an encoding declaration and defaults to UTF-8." UTF-8 encodes code point U+0000 as a null byte and this code point is nowhere excluded in the doc. (The definition of string literals uses 'source character' without any additional specification, so I take it to mean 'Unicode code point'.)

If U+0000 is a legal 'source character', it, as with other control chars not given special meaning, should be a SyntaxError unless occurring in a comment or string literal. Eval and exec exclude even the latter with 
TypeError: source code string cannot contain null bytes
If null bytes are legal, this is wrong.

Simply truncating lines as done by the CPython parser is wrong whether not not U+0000 is legal.

The simplest change would be to change the parser to match exec and add " other than U+000" after "Unicode code points" in the sentence quoted above.

History
Date	User	Action	Args
2014-01-10 18:22:35	terry.reedy	set	recipients: + terry.reedy, arigo, benjamin.peterson, serhiy.storchaka
2014-01-10 18:22:35	terry.reedy	set	messageid: <1389378155.4.0.678381113068.issue20115@psf.upfronthosting.co.za>
2014-01-10 18:22:35	terry.reedy	link	issue20115 messages
2014-01-10 18:22:34	terry.reedy	create