Message 327702 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ausaki
Recipients	ausaki, xiang.zhang, xtreak
Date	2018-10-14.11:12:29
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<CAJiYAJY9JzfxuXdVjEHDwQUJsUPE9P1Rcj+8dhHofQ4xMGVmuw@mail.gmail.com>
In-reply-to	<1539508244.82.0.788709270274.issue34979@psf.upfronthosting.co.za>

Content
I think these two issue is the same issue, and the following is a patch write by me, hope this patch will help. ``` diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c index 1af27bf..ba6fb3a 100644 --- a/Parser/tokenizer.c +++ b/Parser/tokenizer.c @@ -617,32 +617,21 @@ decoding_fgets(char s, int size, struct tok_state tok) if (!check_coding_spec(line, strlen(line), tok, fp_setreadl)) { return error_ret(tok); } - } -#ifndef PGEN - /* The default encoding is UTF-8, so make sure we don't have any - non-UTF-8 sequences in it. / - if (line && !tok->encoding) { - unsigned char c; - int length; - printf("[DEBUG] - [decoding_fgets]: line = %s\n", line); - for (c = (unsigned char )line; c; c += length) - if (!(length = valid_utf8(c))) { - badchar = c; - break; + if(!tok->encoding){ + char cs = new_string("utf-8", 5, tok); + int r = fp_setreadl(tok, cs); + if (r) { + tok->encoding = cs; + tok->decoding_state = STATE_NORMAL; + } else { + PyErr_Format(PyExc_SyntaxError, + "You did not decalre the file encoding at the top of the file, " + "and we found that the file is not encoding by utf-8," + "see http://python.org/dev/peps/pep-0263/ for details."); + PyMem_FREE(cs); } + } } - if (badchar) { - /* Need to add 1 to the line number, since this line - has not been counted, yet. / - PyErr_Format(PyExc_SyntaxError, - "Non-UTF-8 code starting with '\\x%.2x' " - "in file %U on line %i, " - "but no encoding declared; " - "see http://python.org/dev/peps/pep-0263/ for details", - badchar, tok->filename, tok->lineno + 1); - return error_ret(tok); - } -#endif return line; } ``` by the way, my platform is macOS Mojave Version 10.14 Karthikeyan Singaravelan <report@bugs.python.org> 于2018年10月14日周日下午5:10写道： > > Karthikeyan Singaravelan <tir.karthi@gmail.com> added the comment: > > Got it. Thanks for the details and patience. I tested with less number of > characters and it seems to work fine so using the encoding at the top is > not a good way to test the original issue as you have mentioned. Then I > searched around and found issue14811 with test. This seems to be a very > similar issue and there is a patch to detect this scenario to throw > SyntaxError that the line is longer than the internal buffer instead of an > encoding related error. I applied the patch to master and it throws an > error about the internal buffer length as expected. But the patch was not > applied and it seems Victor had another solution in mind as per msg167154. > I tested with the patch as below : > > # master > > ➜ cpython git:(master) cat ../backups/bpo34979.py > > s = > '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试' > > print("str len : ", len(s)) > print("bytes len : ", len(s.encode('utf-8'))) > ➜ cpython git:(master) ./python.exe ../backups/bpo34979.py > File "../backups/bpo34979.py", line 2 > SyntaxError: Non-UTF-8 code starting with '\xe8' in file > ../backups/bpo34979.py on line 2, but no encoding declared; see > http://python.org/dev/peps/pep-0263/ for details > > > # Applying the patch file from issue14811 > > ➜ cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py > File "../backups/bpo34979.py", line 2 > SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the > internal buffer (1024) > > # Patch on master > > diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c > index fc75bae537..48b3ac0ee9 100644 > --- a/Parser/tokenizer.c > +++ b/Parser/tokenizer.c > @@ -586,6 +586,7 @@ static char > decoding_fgets(char s, int size, struct tok_state tok) > { > char line = NULL; > + size_t len; > int badchar = 0; > for (;;) { > if (tok->decoding_state == STATE_NORMAL) { > @@ -597,6 +598,15 @@ decoding_fgets(char s, int size, struct tok_state > tok) > / We want a 'raw' read. / > line = Py_UniversalNewlineFgets(s, size, > tok->fp, NULL); > + if (line != NULL) { > + len = strlen(line); > + if (1 < len && line[len-1] != '\n') { > + PyErr_Format(PyExc_SyntaxError, > + "Line %i of file %U is longer than the > internal buffer (%i)", > + tok->lineno + 1, tok->filename, size); > + return error_ret(tok); > + } > + } > break; > } else { > / We have not yet determined the encoding. > > > If it's the same issue then I think closing this issue and discussing > there will be good since the issue has a patch with test and relevant > discussion. Also it seems BUFSIZ is platform dependent so adding your > platform details would also help. > > TIL about difference Python 2 and 3 on handling unicode related files. > Thanks again! > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue34979> > _______________________________________ >

I think these two issue is the same issue, and the following is a patch
write by me, hope this patch will help.

```
diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index 1af27bf..ba6fb3a 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -617,32 +617,21 @@ decoding_fgets(char *s, int size, struct tok_state
*tok)
         if (!check_coding_spec(line, strlen(line), tok, fp_setreadl)) {
             return error_ret(tok);
         }
-    }
-#ifndef PGEN
-    /* The default encoding is UTF-8, so make sure we don't have any
-       non-UTF-8 sequences in it. */
-    if (line && !tok->encoding) {
-        unsigned char *c;
-        int length;
-        printf("[DEBUG] - [decoding_fgets]: line = %s\n", line);
-        for (c = (unsigned char *)line; *c; c += length)
-            if (!(length = valid_utf8(c))) {
-                badchar = *c;
-                break;
+        if(!tok->encoding){
+            char* cs = new_string("utf-8", 5, tok);
+            int r = fp_setreadl(tok, cs);
+            if (r) {
+                tok->encoding = cs;
+                tok->decoding_state = STATE_NORMAL;
+            } else {
+                PyErr_Format(PyExc_SyntaxError,
+                             "You did not decalre the file encoding at the
top of the file, "
+                             "and we found that the file is not encoding
by utf-8,"
+                             "see http://python.org/dev/peps/pep-0263/ for
details.");
+                PyMem_FREE(cs);
             }
+        }
     }
-    if (badchar) {
-        /* Need to add 1 to the line number, since this line
-           has not been counted, yet.  */
-        PyErr_Format(PyExc_SyntaxError,
-                "Non-UTF-8 code starting with '\\x%.2x' "
-                "in file %U on line %i, "
-                "but no encoding declared; "
-                "see http://python.org/dev/peps/pep-0263/ for details",
-                badchar, tok->filename, tok->lineno + 1);
-        return error_ret(tok);
-    }
-#endif
     return line;
 }
```

by the way, my platform is macOS Mojave Version 10.14

Karthikeyan Singaravelan <report@bugs.python.org> 于2018年10月14日周日 下午5:10写道：

>
> Karthikeyan Singaravelan <tir.karthi@gmail.com> added the comment:
>
> Got it. Thanks for the details and patience. I tested with less number of
> characters and it seems to work fine so using the encoding at the top is
> not a good way to test the original issue as you have mentioned. Then I
> searched around and found issue14811 with test. This seems to be a very
> similar issue and there is a patch to detect this scenario to throw
> SyntaxError that the line is longer than the internal buffer instead of an
> encoding related error. I applied the patch to master and it throws an
> error about the internal buffer length as expected. But the patch was not
> applied and it seems Victor had another solution in mind as per msg167154.
> I tested with the patch as below :
>
> # master
>
> ➜  cpython git:(master) cat ../backups/bpo34979.py
>
> s =
> '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'
>
> print("str len : ", len(s))
> print("bytes len : ", len(s.encode('utf-8')))
> ➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
>   File "../backups/bpo34979.py", line 2
> SyntaxError: Non-UTF-8 code starting with '\xe8' in file
> ../backups/bpo34979.py on line 2, but no encoding declared; see
> http://python.org/dev/peps/pep-0263/ for details
>
>
> # Applying the patch file from issue14811
>
> ➜  cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py
>   File "../backups/bpo34979.py", line 2
> SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the
> internal buffer (1024)
>
> # Patch on master
>
> diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
> index fc75bae537..48b3ac0ee9 100644
> --- a/Parser/tokenizer.c
> +++ b/Parser/tokenizer.c
> @@ -586,6 +586,7 @@ static char *
>  decoding_fgets(char *s, int size, struct tok_state *tok)
>  {
>      char *line = NULL;
> +    size_t len;
>      int badchar = 0;
>      for (;;) {
>          if (tok->decoding_state == STATE_NORMAL) {
> @@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state
> *tok)
>              /* We want a 'raw' read. */
>              line = Py_UniversalNewlineFgets(s, size,
>                                              tok->fp, NULL);
> +           if (line != NULL) {
> +                len = strlen(line);
> +                if (1 < len && line[len-1] != '\n') {
> +                    PyErr_Format(PyExc_SyntaxError,
> +                            "Line %i of file %U is longer than the
> internal buffer (%i)",
> +                                tok->lineno + 1, tok->filename, size);
> +                    return error_ret(tok);
> +                }
> +            }
>              break;
>          } else {
>              /* We have not yet determined the encoding.
>
>
> If it's the same issue then I think closing this issue and discussing
> there will be good since the issue has a patch with test and relevant
> discussion. Also it seems BUFSIZ is platform dependent so adding your
> platform details would also help.
>
> TIL about difference Python 2 and 3 on handling unicode related files.
> Thanks again!
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue34979>
> _______________________________________
>

History
Date	User	Action	Args
2018-10-14 11:12:29	ausaki	set	recipients: + ausaki, xiang.zhang, xtreak
2018-10-14 11:12:29	ausaki	link	issue34979 messages
2018-10-14 11:12:29	ausaki	create