classification
Title: NUL bytes in commented lines
Type: behavior Stage: needs patch
Components: Interpreter Core Versions: Python 3.3, Python 3.4, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, alex, arigo, benjamin.peterson, georg.brandl, ita1024, jwilk, serhiy.storchaka, terry.reedy
Priority: low Keywords:

Created on 2014-01-03 17:59 by arigo, last changed 2014-05-12 09:24 by jwilk.

Messages (11)
msg207232 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2014-01-03 17:59
This is probably the smallest example of a .py file that behaves differently in CPython vs PyPy, and for once, I'd argue that the CPython behavior is unexpected:

   # make the file:
   >>> open('x.py', 'wb').write('#\x00\na')

   # run it:
   python x.py

Expected: either some SyntaxError, or "NameError: global name 'a' is not defined".  Got: nothing.  It seems that CPython completely ignores the line that is immediately after a line with a '#' and a following '\x00'.
msg207282 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-04 13:18
Indeed.  CPython parser reads first line '#\x00\n' and save it in the buffer. But because C strings are used here (result of decode_str()), the line is truncated to '#'. As far as this data is not ended by '\n', it considered incomplete and next line is read and appended: '#' + 'a' -> '#a'. And this line is commented out now.
msg207290 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2014-01-04 15:35
I guess NULL bytes should just be banned.
msg207358 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2014-01-05 07:22
Fwiw, both exec and eval() ban NUL bytes, which means that there is a strange case in which some files can be imported, but not loaded and exec'ed.  So I agree with Benjamin.
msg207872 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-01-10 18:22
Python should have a uniform definition of 'Python source' in both the doc and in practice in all source code processing functions. Currently, "2. Lexical analysis" in the Language Manual just says "Python reads program text as Unicode code points; the encoding of a source file can be given by an encoding declaration and defaults to UTF-8." UTF-8 encodes code point U+0000 as a null byte and this code point is nowhere excluded in the doc. (The definition of string literals uses 'source character' without any additional specification, so I take it to mean 'Unicode code point'.)

If U+0000 is a legal 'source character', it, as with other control chars not given special meaning, should be a SyntaxError unless occurring in a comment or string literal. Eval and exec exclude even the latter with 
TypeError: source code string cannot contain null bytes
If null bytes are legal, this is wrong.

Simply truncating lines as done by the CPython parser is wrong whether not not U+0000 is legal.

The simplest change would be to change the parser to match exec and add " other than U+000" after "Unicode code points" in the sentence quoted above.
msg207873 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-01-10 18:23
Armin, what is the different behavior of PyPy?

We should perhaps get Guido's opinion on this issue.
msg207879 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2014-01-10 20:14
PyPy 2.x accepts null characters in all of import, exec and eval, and complains if they occur in non-comment.

PyPy 3.x refuses them in import, which is where this bug report originally comes from (someone complained that CPython 3.x "accepts" them but not PyPy 3.x, even thought this complain doesn't really make sense as CPython just gets very confused by them).  I don't know about exec and eval.

We need a consistent decision for 3.5.  I suppose it's not really worth backporting it to CPython 2.7 - 3.3 - 3.4, but it's your choice.  PyPy will just follow the lead (or keep its current behavior for 2.x if CPython 2.x is not modified).
msg207939 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-01-12 09:28
I'm in favor of PyPy's behavior: null bytes anywhere in the source, even in comments, usually mean there's something weird or fishy going on with either the editor or (if downloaded/copied) the source of the code.
msg208086 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-14 09:12
I'll try, but I'm not sure this is possible. Some used C functions (e.g. fgets()) returns char* and doesn't work with string containing null bytes. Some public API (e.g. PyParser_SimpleParseString()) work with null-terminated C strings.
msg208087 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-14 09:14
See also issue13617.
msg218239 - (view) Author: (ita1024) Date: 2014-05-10 22:42
Do not touch that please!!!!

The null bytes are already rejected when forbidden by the encoding (utf-8 for example).

Null byte characters in comments are perfectly valid in ISO8859-1 encoding, and a few scripts depend on them:
http://ftp.waf.io/pub/release/waf-1.7.16

Parsing the commented lines is also likely to slow down the parser, so keep your hands of it please! There are too many regressions already! http://bugs.python.org/issue21086
History
Date User Action Args
2014-05-12 09:24:52jwilksetnosy: + jwilk
2014-05-11 11:43:18Arfreversetnosy: + Arfrever
2014-05-10 22:46:06alexsetnosy: + alex
2014-05-10 22:42:22ita1024setnosy: + ita1024
messages: + msg218239
2014-01-14 09:14:36serhiy.storchakasetmessages: + msg208087
2014-01-14 09:12:21serhiy.storchakasetmessages: + msg208086
2014-01-12 09:28:36georg.brandlsetnosy: + georg.brandl
messages: + msg207939
2014-01-10 20:14:13arigosetmessages: + msg207879
2014-01-10 18:23:40terry.reedysetmessages: + msg207873
2014-01-10 18:22:35terry.reedysetnosy: + terry.reedy
messages: + msg207872
2014-01-05 07:22:34arigosetmessages: + msg207358
2014-01-04 15:35:38benjamin.petersonsetmessages: + msg207290
2014-01-04 13:18:07serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg207282
2014-01-04 11:49:18pitrousetstage: needs patch
type: compile error -> behavior
versions: + Python 3.3, Python 3.4
2014-01-03 18:02:46vstinnersetnosy: + benjamin.peterson
2014-01-03 17:59:13arigocreate