Issue 38663: Untokenize does not round-trip ws before bs-nl

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/82844

classification

Title:	Untokenize does not round-trip ws before bs-nl
Type:	behavior	Stage:	test needed
Components:		Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	edreamleo, terry.reedy
Priority:	normal	Keywords:

Created on 2019-11-01 17:31 by edreamleo, last changed 2022-04-11 14:59 by admin.

Messages (5)
msg355827 - (view)	Author: Edward K Ream (edreamleo) *	Date: 2019-11-01 17:31
Tested on 3.6. tokenize.untokenize does not round-trip whitespace before backslash-newlines outside of strings: from io import BytesIO import tokenize # Round tripping fails on the second string. table = ( r''' print\ ("abc") ''', r''' print \ ("abc") ''', ) for s in table: tokens = list(tokenize.tokenize( BytesIO(s.encode('utf-8')).readline)) result = g.toUnicode(tokenize.untokenize(tokens)) print(result==s) I have an important use case that would benefit from a proper untokenize. After considerable study, I have not found a proper fix for tokenize.add_whitespace. I would be happy to work with anyone to rewrite tokenize.untokenize so that unit tests pass without fudges in TestRoundtrip.check_roundtrip.
msg355898 - (view)	Author: Edward K Ream (edreamleo) *	Date: 2019-11-03 13:17
The original bug report used a Leo-only function, g.toUnicode. To fix this, replace: result = g.toUnicode(tokenize.untokenize(tokens)) by: result_b = tokenize.untokenize(tokens) result = result_b.decode('utf-8', 'strict')
msg355899 - (view)	Author: Edward K Ream (edreamleo) *	Date: 2019-11-03 13:23
This post https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/VPqtB9lTEAAJ discusses a complete rewrite of tokenizer.untokenize. To quote from the post: I have "discovered" a spectacular replacement for Untokenizer.untokenize in python's tokenize library module. The wretched, buggy, and impossible-to-fix add_whitespace method is gone. The new code has no significant 'if' statements, and knows almost nothing about tokens! This is the way untokenize is written in The Book. The new code should put an end to a long series of issues against untokenize code in python's tokenize library module. Some closed issues were blunders arising from dumbing-down the TestRoundtrip.check_roundtrip method in test_tokenize.py. Imo, the way is now clear for proper unit testing of python's Untokenize class.
msg355900 - (view)	Author: Edward K Ream (edreamleo) *	Date: 2019-11-03 13:27
This post: https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/5X8IDzpgEAAJ discusses unit testing. The summary states: "I've done the heavy lifting on issue 38663. Python devs should handle the details of testing and packaging." I'll leave it at that. In some ways this issue if very minor, and of almost no interest to anyone :-) Do with it as you will. The ball is in python's court.
msg355910 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2019-11-03 22:08
Since these posts were more or less copied to pydev list, I am copying my response on the list here. --- > tl;dr: Various posts, linked below, discuss a much better replacement for untokenize. If that were true, I would be interested. But as explained below, I don't believe it. Even if I did, https://bugs.python.org/issue38663 gives no evidence that you have signed the PSF contributor agreement. In any case, it has no PR. We only use code that is actually contributed on the issue or in a PR under that agreement. To continue, the first two lines of tokenize.untokenize() are ut = Untokenizer() out = ut.untokenize(iterable) Your leoBeautify.Untokenize class appears to be completely unsuited as a replacement for tokenize.Untokenizer as the API for the class and method are incompatible with the above. 1. untokenize.Untokenizer takes no argument. leoBeautify.Untokenize() requires a 'contents' argument, a (unicode) string, that is otherwise undocumented. At first glance, it appears that 'contents' needs to be something like the desired output. (I could read the code where you call Untokenizer to improve my guess, but not now.) Since our exising tests do not pass 'contents', they should all fail. 2. untokenize.Untokenizer.untokenize(iterable) require an iterable that returns "sequences with at least two elements, the token type and the token string." https://docs.python.org/3/library/tokenize.html#tokenize.untokenize One can generate python code from a sequence of pairs with a guarantee that the resulting code will be tokenized by the python.exe parser into the same sequence. The doc continues "Any additional sequence elements are ignored." The intent is that a tool can tokenize a file, modify the file (and thereby possibly invalidate the begin, end, and line elements of the original token stream) and generate a modified file. [Note that the end index (4th element), when present, is not ignored but is used to improve white space insertion. I believe that this should be documented. What if the end index is no longer valid? Should be also use the start index?] leoBeautify.Untokenize.untokenize() requires an iterable of 5-tuples. It makes uses of both the start and end elements, as well as the mysterious required 'contents' string. > I have "discovered" a spectacular replacement for Untokenizer.untokenize in python's tokenize library module: To pass 'code == untokenize(tokenize(code))' (ignoring api details), there is an even more spectacular replacement: rebuild the code from the 'line' elements. But while the above is an essential test, it is a toy example with respect to applications. The challenge is to create a correct and valid file from less information, possibly with only token type and string. (The latter is 'compatibility mode'.) > In particular, it is, imo, time to remove compatibility mode. And break all usage that requires it? Before doing much more with tokenize, I would want to understand more how it is actually used. > Imo, python devs are biased in favor of parse trees in programs involving text manipulations. [snip] So why have 46 of us contributed to this one module? This sort of polemic is a net negative here. We a multiple individuals with differing opinions.

History
Date	User	Action	Args
2022-04-11 14:59:22	admin	set	github: 82844
2019-11-03 22:08:12	terry.reedy	set	versions: + Python 3.9, - Python 3.6 nosy: + terry.reedy messages: + msg355910 stage: test needed
2019-11-03 13:27:10	edreamleo	set	messages: + msg355900
2019-11-03 13:23:54	edreamleo	set	messages: + msg355899
2019-11-03 13:17:57	edreamleo	set	messages: + msg355898
2019-11-01 17:31:35	edreamleo	create