Issue 41063: Avoid using the locale encoding for open() in tests

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/85235

classification

Title:	Avoid using the locale encoding for open() in tests
Type:	enhancement	Stage:
Components:	Tests	Versions:	Python 3.10, Python 3.9, Python 3.8, Python 3.7

process

Status:	open	Resolution:
Dependencies:	41048 41055 41058 41069 41136 41137 41138 41139 41140 41143 41150	Superseder:
Assigned To:		Nosy List:	methane, serhiy.storchaka, vstinner
Priority:	normal	Keywords:

Created on 2020-06-21 11:04 by serhiy.storchaka, last changed 2022-04-11 14:59 by admin.

Messages (1)
msg371994 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-06-21 11:04
Many tests use open() with the locale encoding for writing or reading files. They are passed because the written and read data a ASCII, and file paths are ASCII. But they do not test the case of non-ASCII data and file paths. In general, most of uses of the locale encoding should be changed. 1. In some cases it is enough to open the file in binary mode. For example when create an empty file, or use just fileno of the opened file. 2. In some cases the file should be opened in binary mode. For example, when compile the content of the file or parse it as XML, because the correct encoding is determined by the content (BOM, encoding coockie, XML declaration). 3. tokenize.open() or tokenize.detect_encoding() should be used when we read a Python source as a text. 4. os.fsdecode() and os.fsencode() may be used if the test file contains file paths and is read by bash or other external program. 5. encoding='ascii' should be specified if the test data always ASCII-only. 6. encoding='utf-8' should be specified if the test data can contain arbitrary Unicode characters. 7. Encoding different from 'ascii', 'latin1' and 'utf-8' should be used if arbitrary encodings should be supported. 8. Implicit locale encoding should be only used if the test is purposed to test the implicit encoding. It is preferable to add non-ASCII characters in the test data. I am working on a large patch for this (>50% is ready). Some parts of it may be extracted as separate PRs, and the rest will be exposed as a large PR. If changes are required not only in tests. separate issues will be opened.

msg371994 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2020-06-21 11:04

Many tests use open() with the locale encoding for writing or reading files. They are passed because the written and read data a ASCII, and file paths are ASCII. But they do not test the case of non-ASCII data and file paths. In general, most of uses of the locale encoding should be changed.

1. In some cases it is enough to open the file in binary mode. For example when create an empty file, or use just fileno of the opened file.

2. In some cases the file should be opened in binary mode. For example, when compile the content of the file or parse it as XML, because the correct encoding is determined by the content (BOM, encoding coockie, XML declaration).

3. tokenize.open() or tokenize.detect_encoding() should be used when we read a Python source as a text.

4. os.fsdecode() and os.fsencode() may be used if the test file contains file paths and is read by bash or other external program.

5. encoding='ascii' should be specified if the test data always ASCII-only.

6. encoding='utf-8' should be specified if the test data can contain arbitrary Unicode characters.

7. Encoding different from 'ascii', 'latin1' and 'utf-8' should be used if arbitrary encodings should be supported.

8. Implicit locale encoding should be only used if the test is purposed to test the implicit encoding.

It is preferable to add non-ASCII characters in the test data.

I am working on a large patch for this (>50% is ready). Some parts of it may be extracted as separate PRs, and the rest will be exposed as a large PR. If changes are required not only in tests. separate issues will be opened.

History
Date	User	Action	Args
2022-04-11 14:59:32	admin	set	github: 85235
2020-06-28 16:23:08	serhiy.storchaka	set	dependencies: + pipes uses text files and the locale encodig
2020-06-27 15:14:43	serhiy.storchaka	set	dependencies: + distutils uses the locale encoding for the .pypirc file
2020-06-27 11:32:06	serhiy.storchaka	set	dependencies: + cgi uses the locale encoding for log files, cgitb uses the locale encoding for log files
2020-06-27 10:56:24	serhiy.storchaka	set	dependencies: + trace CLI reads source files using the locale encoding
2020-06-27 08:32:28	serhiy.storchaka	set	dependencies: + pdb uses the locale encoding for .pdbrc
2020-06-27 07:14:37	serhiy.storchaka	set	dependencies: + argparse uses default encoding when read arguments from file
2020-06-22 07:30:58	serhiy.storchaka	set	dependencies: + Use non-ascii file names in tests by default
2020-06-21 11:07:17	serhiy.storchaka	set	dependencies: + read_mime_types() should read the rule file using UTF-8, not the locale encoding, Remove outdated tests for tp_print, pdb reads source files using the locale encoding
2020-06-21 11:04:15	serhiy.storchaka	create