Issue 10335: tokenize.open(): open a file with encoding detected from a coding cookie

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54544

classification

Title:	tokenize.open(): open a file with encoding detected from a coding cookie
Type:	enhancement	Stage:	patch review
Components:	Library (Lib), Unicode	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	belopolsky, eric.araujo, georg.brandl, pitrou, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-11-06 10:49 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
open_python.patch	vstinner, 2010-11-06 10:49		review
tokenize_open-2.patch	vstinner, 2010-11-07 09:24

Messages (15)
msg120600 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-11-06 10:49
In Python3, the following pattern becomes common: with open(fullname, 'rb') as fp: coding, line = tokenize.detect_encoding(fp.readline) with open(fullname, 'r', encoding=coding) as fp: ... It opens the file is opened twice, whereas it is unnecessary: it's possible to reuse the raw buffer to create a text file. And I don't like the detect_encoding() API: pass the readline function is not intuitive. I propose to create tokenize.open_python() function with a very simple API: just one argument, the filename. This function calls detect_encoding() and only open the file once. Attached python adds the function with an unit test and a patch on the documentation. It patchs also functions currently using detect_encoding(). open_python() only supports read mode. I suppose that it is enough.
msg120616 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-11-06 14:31
Looks good. In the test, do you have to remove the TESTFN file manually? Isn’t that handled by unittest or regrtest? I think the name of the function could be better but I don’t have a proposal.
msg120628 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-11-06 15:42
> In the test, do you have to remove the TESTFN file manually? Isn’t that > handled by unittest or regrtest? The test uses TESTFN+'.py', not TESTFN ;-) > I think the name of the function could be better but I don’t have a > proposal. Do you prefer a name like tokenize.open()?
msg120629 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-11-06 15:44
> The test uses TESTFN+'.py', not TESTFN ;-) Ah, right. You can use addCleanup then. > Do you prefer a name like tokenize.open()? Hm, tokenize being “Tokenization help for Python programs”, I think this name is rather good. Or tokenize.open_source.
msg120632 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-06 16:00
+1 on adding the function. Note that it is useful for opening any text file with an encoding cookie, not only python source code, so "tokenize.open()" sounds attractive. Once we are at it, I would like to factor out and document code that extracts the cookie from a decoded string. This functionality is useful for tools that produce annotated source code and want to preserve source encoding in their output. See issue10329. Also, to answer Victor's comment on issue10329, msg120601, while using this function simplifies dealing with source files in trace module, it does not solve all the issues: source file will still be opened separately in linecache.gettines() and find_executable_linenos() and find_executable_linenos() still does not respect __loader__ interface. See issue 10342.
msg120635 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-11-06 16:15
> Note that it is useful for opening any text file with an encoding > cookie, not only python source code, This is cool. doctest could benefit from that, I think. > so "tokenize.open()" sounds attractive. Agreed, even though the docstring of tokenize does talk about Python code.
msg120637 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-06 16:25
On Sat, Nov 6, 2010 at 12:15 PM, Éric Araujo <report@bugs.python.org> wrote: .. >> so "tokenize.open()" sounds attractive. > Agreed, even though the docstring of tokenize does talk about Python code. I still like "tokenize.open()" more, but for the sake of completeness of discussion, what about open(.., encoding="fromcookie")? (I know, tokenize.detect_encoding() also looks for an utf-8 BOM, so "fromcookie" can be improved.)
msg120638 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-11-06 16:32
I don’t know if it would be okay to depend on tokenize in io. Anyway, I tend to prefer tokenize.open over builtins.open(..., encoding='fromcookie').
msg120643 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2010-11-06 18:00
+1 for the feature. However, tokenize.open() sounds a bit unspecific. But I don't really have better suggestions; open_cookie() is wrong too, since it doesn't open cookies :)
msg120661 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-11-07 09:04
> what about open(.., encoding="fromcookie")? Please don't do that. It remembers me the magical str.encode() method of Python2. I prefer simple API with limited features: if you want to open a Python script, use tokenize.open(), if you want to open an XML document use any XML library, etc. It remembers me also the discussion about detecting BOM in text files: issue #7651. There was no consensus, and so the issue is still open. For this particular issue, I prefer a specific function tokenize.<choose a function name>.
msg120662 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-11-07 09:24
On Saturday 06 November 2010 17:00:15 you wrote: > Note that it is useful for opening any text file with an encoding cookie, > not only python source code, so "tokenize.open()" sounds attractive. Ok, the new patch (tokenize_open-2.patch) uses tokenize.open() name and adds a test for BOM without coding cookie (test utf-8-sig encoding).
msg120755 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-08 15:28
On Sun, Nov 7, 2010 at 4:24 AM, STINNER Victor <report@bugs.python.org> wrote: .. > Ok, the new patch (tokenize_open-2.patch) uses tokenize.open() name and adds a > test for BOM without coding cookie (test utf-8-sig encoding). Here are my comments on the new patch: 1. A nit-pick: tokenize.py is already inconsistent in this respect, but I believe it is conventional not to start a docsting with an empty line: +def open(filename): + """Open a Python script in read mode with the right encoding. instead of +def open(filename): + """ + Open a Python script in read mode with the right encoding. Also, I would prefer a more neutral description that will not imply that the method is only useful for Python source code. For example: """Detect encoding and open a file in read mode with the right encoding. (optionally a more detailed explanation or a reference to detect_encoding.) """ 2. Another nit-pick: detect_encoding() returns a list of 0-2 lines in the second item, not a single line, so it is better to write + encoding, lines = detect_encoding(buffer.readline) or even + encoding, _ = detect_encoding(buffer.readline) instead of + encoding, line = detect_encoding(buffer.readline) 3. In test_open(self), I think you should use self.assertEqual() instead of a plain assert. 4. Last nit-pick (feel free to ignore as prejudice to %-formatting :-): instead of + print("# coding: %s" % encoding, file=fp) you can write + print("# coding: " + encoding, file=fp)
msg120756 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-08 15:34
Changing the title to make the latest choice of function name more visible.
msg120758 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-11-08 15:35
1. Agreed on both. Extraneous blank lines seem to be favored by people who need to work around bugs in their editor. 2. Yes, “_” is recommended for a discarded object. 3. Agreed. 4. Even better: print('# coding:', encoding, file=fp). No intermediary string!
msg120831 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-11-09 01:12
Commited to Python 3.2 (r86346 + r86347). Thanks for your reviews.

History
Date	User	Action	Args
2022-04-11 14:57:08	admin	set	github: 54544
2010-11-09 01:12:02	vstinner	set	status: open -> closed resolution: fixed messages: + msg120831
2010-11-08 15:35:39	eric.araujo	set	messages: + msg120758 title: tokenize.open(): open a Python file with the right encoding -> tokenize.open(): open a file with encoding detected from a coding cookie
2010-11-08 15:34:35	belopolsky	set	type: enhancement title: tokenize.open_python(): open a Python file with the right encoding -> tokenize.open(): open a Python file with the right encoding messages: + msg120756 stage: patch review
2010-11-08 15:28:13	belopolsky	set	messages: + msg120755
2010-11-07 09:24:45	vstinner	set	files: + tokenize_open-2.patch messages: + msg120662
2010-11-07 09:04:47	vstinner	set	messages: + msg120661
2010-11-06 18:00:59	georg.brandl	set	nosy: + georg.brandl messages: + msg120643
2010-11-06 16:32:00	eric.araujo	set	nosy: + pitrou messages: + msg120638
2010-11-06 16:25:31	belopolsky	set	messages: + msg120637
2010-11-06 16:15:12	eric.araujo	set	messages: + msg120635
2010-11-06 16:00:12	belopolsky	set	nosy: + belopolsky messages: + msg120632
2010-11-06 15:44:13	eric.araujo	set	messages: + msg120629
2010-11-06 15:42:20	vstinner	set	messages: + msg120628
2010-11-06 14:31:04	eric.araujo	set	nosy: + eric.araujo messages: + msg120616
2010-11-06 10:49:38	vstinner	create