This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author terry.reedy
Recipients arigo, benjamin.peterson, serhiy.storchaka, terry.reedy
Date 2014-01-10.18:22:34
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1389378155.4.0.678381113068.issue20115@psf.upfronthosting.co.za>
In-reply-to
Content
Python should have a uniform definition of 'Python source' in both the doc and in practice in all source code processing functions. Currently, "2. Lexical analysis" in the Language Manual just says "Python reads program text as Unicode code points; the encoding of a source file can be given by an encoding declaration and defaults to UTF-8." UTF-8 encodes code point U+0000 as a null byte and this code point is nowhere excluded in the doc. (The definition of string literals uses 'source character' without any additional specification, so I take it to mean 'Unicode code point'.)

If U+0000 is a legal 'source character', it, as with other control chars not given special meaning, should be a SyntaxError unless occurring in a comment or string literal. Eval and exec exclude even the latter with 
TypeError: source code string cannot contain null bytes
If null bytes are legal, this is wrong.

Simply truncating lines as done by the CPython parser is wrong whether not not U+0000 is legal.

The simplest change would be to change the parser to match exec and add " other than U+000" after "Unicode code points" in the sentence quoted above.
History
Date User Action Args
2014-01-10 18:22:35terry.reedysetrecipients: + terry.reedy, arigo, benjamin.peterson, serhiy.storchaka
2014-01-10 18:22:35terry.reedysetmessageid: <1389378155.4.0.678381113068.issue20115@psf.upfronthosting.co.za>
2014-01-10 18:22:35terry.reedylinkissue20115 messages
2014-01-10 18:22:34terry.reedycreate