This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author xtreak
Recipients ausaki, xiang.zhang, xtreak
Date 2018-10-14.09:10:44
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1539508244.82.0.788709270274.issue34979@psf.upfronthosting.co.za>
In-reply-to
Content
Got it. Thanks for the details and patience. I tested with less number of characters and it seems to work fine so using the encoding at the top is not a good way to test the original issue as you have mentioned. Then I searched around and found issue14811 with test. This seems to be a very similar issue and there is a patch to detect this scenario to throw SyntaxError that the line is longer than the internal buffer instead of an encoding related error. I applied the patch to master and it throws an error about the internal buffer length as expected. But the patch was not applied and it seems Victor had another solution in mind as per msg167154. I tested with the patch as below : 

# master

➜  cpython git:(master) cat ../backups/bpo34979.py

s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'

print("str len : ", len(s))
print("bytes len : ", len(s.encode('utf-8')))
➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
  File "../backups/bpo34979.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file ../backups/bpo34979.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details


# Applying the patch file from issue14811

➜  cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py
  File "../backups/bpo34979.py", line 2
SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the internal buffer (1024)

# Patch on master

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index fc75bae537..48b3ac0ee9 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -586,6 +586,7 @@ static char *
 decoding_fgets(char *s, int size, struct tok_state *tok)
 {
     char *line = NULL;
+    size_t len;
     int badchar = 0;
     for (;;) {
         if (tok->decoding_state == STATE_NORMAL) {
@@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state *tok)
             /* We want a 'raw' read. */
             line = Py_UniversalNewlineFgets(s, size,
                                             tok->fp, NULL);
+           if (line != NULL) {
+                len = strlen(line);
+                if (1 < len && line[len-1] != '\n') {
+                    PyErr_Format(PyExc_SyntaxError,
+                            "Line %i of file %U is longer than the internal buffer (%i)",
+                                tok->lineno + 1, tok->filename, size);
+                    return error_ret(tok);
+                }
+            }
             break;
         } else {
             /* We have not yet determined the encoding.


If it's the same issue then I think closing this issue and discussing there will be good since the issue has a patch with test and relevant discussion. Also it seems BUFSIZ is platform dependent so adding your platform details would also help.

TIL about difference Python 2 and 3 on handling unicode related files. Thanks again!
History
Date User Action Args
2018-10-14 09:10:44xtreaksetrecipients: + xtreak, xiang.zhang, ausaki
2018-10-14 09:10:44xtreaksetmessageid: <1539508244.82.0.788709270274.issue34979@psf.upfronthosting.co.za>
2018-10-14 09:10:44xtreaklinkissue34979 messages
2018-10-14 09:10:44xtreakcreate