Message 327686 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ausaki
Recipients	ausaki
Date	2018-10-14.02:01:37
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1539482499.11.0.788709270274.issue34979@psf.upfronthosting.co.za>
In-reply-to

Content
``` # demo.py s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试' ``` The file on above is for testing, it's encoding is utf-8, the length of `s` is 1020 bytes(3 * 340). When execute `python3 demo.py` on terminal, Python will throws the following error: ``` $ python3 -V Python 3.6.4 $ python3 demo.py File "demo.py", line 2 SyntaxError: Non-UTF-8 code starting with '\xe8' in file demo.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details ``` I've found this error occurred on about line 630(the bottom of the function `decoding_fgets`) of the file `cpython/Parser/tokenizer.c` after I read Python-3.6.6's source code. When Python execute xxx.py, Python will call the function `decoding_fgets` to read one line of raw bytes from file and save the raw bytes to a buffer, the initial length of the buffer is 1024 bytes, `decoding_fgets` will use the function `valid_utf8` to check raw bytes's encoding. If the lenght of raw bytes is too long(like greater than 1023 bytes), then Python will call `decoding_fgets` multiple times and increase buffer's size by 1024 bytes every time.so raw bytes read by `decoding_fgets` is maybe incomplete, for example, raw bytes contains a part of bytes of a character, that will cause `valide_utf8` failed. I suggest that we should always use `fp_readl` to read source coe from file.

```
# demo.py
s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'
```
The file on above is for testing, it's encoding is utf-8, the length of `s` is 1020 bytes(3 * 340).

When execute `python3 demo.py` on terminal, Python will throws the following error:

```
$ python3 -V
Python 3.6.4

$ python3 demo.py
  File "demo.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file demo.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
```

I've found this error occurred on about line 630(the bottom of the function `decoding_fgets`) of the file `cpython/Parser/tokenizer.c` after I read Python-3.6.6's source code.

When Python execute xxx.py, Python will call the function `decoding_fgets` to read one line of raw bytes from file and save the raw bytes to a buffer, the initial length of the buffer is 1024 bytes, `decoding_fgets` will use the function `valid_utf8` to check raw bytes's encoding.

If the lenght of raw bytes is too long(like greater than 1023 bytes), then Python will call `decoding_fgets` multiple times and increase buffer's size by 1024 bytes every time.so raw bytes read by `decoding_fgets` is maybe incomplete, for example, raw bytes contains a part of bytes of a character, that will cause `valide_utf8` failed.

I suggest that we should always use `fp_readl` to read source coe from file.

History
Date	User	Action	Args
2018-10-14 02:01:39	ausaki	set	recipients: + ausaki
2018-10-14 02:01:39	ausaki	set	messageid: <1539482499.11.0.788709270274.issue34979@psf.upfronthosting.co.za>
2018-10-14 02:01:38	ausaki	link	issue34979 messages
2018-10-14 02:01:37	ausaki	create