classification
Title: Handle bytes comparisons in difflib.Differ
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gward Nosy List: barry, berker.peksag, durin42, gward, ncoghlan, pitrou, python-dev, r.david.murray, terry.reedy
Priority: normal Keywords: patch

Created on 2013-03-17 14:42 by barry, last changed 2015-04-21 03:06 by berker.peksag. This issue is now closed.

Files
File name Uploaded Description Edit
bytes_diff.py terry.reedy, 2013-03-19 22:51 Wrap difflib.context/unified_diff
13161c1d9c5f.diff terry.reedy, 2015-04-15 21:28 review
13161c1d9c5-difflibf.diff terry.reedy, 2015-04-15 22:22 review
fa4c6160c518.diff gward, 2015-04-17 01:23 review
Repositories containing patches
http://hg.gerg.ca/cpython
Messages (29)
msg184379 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2013-03-17 14:42
This came up at the Pycon 2013 Python 3 porting clinic.  There are many cases in the stdlib that claim (either explicitly or implicitly) to accept bytes or strings, but that don't return the type of the arguments they accept.  An example is urllib.parse.quote() which accepts bytes or str but always returns str.  A similar example brought up at the clinic was difflib, which accepts both types, and works internally on both, but crashes when joining the results for return.

It should be policy for the stdlib (i.e. codified in an informational PEP and including bug reports, because they *are* bugs, not features or baked-in API) where bytes or str are accepted but the right things are not done (i.e. return the type you accept).

This bug captures the principle, and probably should be closed once such a PEP is accepted, with individual bugs opened for each individual case.
msg184381 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-03-17 15:10
There was a long thread about this on python-dev that might be worth going back over, where I had the same misconception (that functions should always return the same type as their arguments).  While I think that should be the default design, it isn't always the best API.  (The real rule, if I recall correctly, is that functions should never accept *mixed* argument types for input data.)
msg184382 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2013-03-17 15:19
On Mar 17, 2013, at 03:10 PM, R. David Murray wrote:

>There was a long thread about this on python-dev that might be worth going
>back over, where I had the same misconception (that functions should always
>return the same type as their arguments).  While I think that should be the
>default design, it isn't always the best API.  (The real rule, if I recall
>correctly, is that functions should never accept *mixed* argument types for
>input data.)

Totally agree about the mixed type rule.

But this is something different, and I know it's been discussed, and is a
difficult problem.  It's causing real-world pain for people though, so it's
worth thinking about again.
msg184409 - (view) Author: Greg Ward (gward) (Python committer) Date: 2013-03-18 00:26
The particular use case that triggered this: Mercurial's test suite. It runs "hg blah blah" and compares the output against known good output. But Mercurial's output is just bytes, because pretty much everything in a Mercurial repo is just bytes (file data of course, but also filenames and even changeset metadata like usernames).

So attempting to run the Mercurial test suite under 3.x immediately fails hard. The boiled-down essence of the bug is this:

>>> import difflib
>>> a = b"hello world"
>>> b = b"goodbye world"
>>> [line for line in difflib.unified_diff(a, b)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/home/greg/src/cpython/3.2/Lib/difflib.py", line 1224, in unified_diff
    yield '-' + line
TypeError: Can't convert 'int' object to str implicitly
msg184424 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-03-18 05:35
That looks like a bug in difflib (regardless of what type it returns).  But note that I'm agreeing that returning the same type you are given is generally preferrable.
msg184428 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-03-18 06:17
Changing behavior that already matches the docs is an enhancement, not a bugfix, and one that will almost certainly break code. It is therefore one that would normally require a deprecation period. I think the most you should ask for is to skip the deprecation period.

I believe the urllib and difflib problems are quite different. I am going to presume that urllib simply converts bytes input to str and goes on from there, returning the result as str rather than (possibly) converting back to bytes. That is an example for this issue.

Difflib.unified_diff, on the other hand, raises rather than returning an unexpected or undesired type. The 3 sections like the below have two problems given the toy input of two bytes objects.

            if tag in {'replace', 'delete'}:
                for line in a[i1:i2]:
                    yield '-' + line

First, iterating bytes or a slice of bytes returns ints, not 1-byte bytes. Hence the exception. Even if that were worked around, the mixed string constant + bytes expression would raise a TypeError. One fix for both problems would be to change the expression to '-' + str(line).

Neither of these problems are bugs. The doc says "Compare a and b (lists of strings)". Actually, 'sequence of strings' is sufficient. For the operations of unified_diff, a string looks like a sequence of 1-char strings, which is why

>>> for l in difflib.unified_diff('ab', 'c'): print(l)

--- 

+++ 

@@ -1,2 +1 @@

-a
-b
+c

works.

The other lines yielded by unified_diff are produced with str.format, and % formatting does not seem to work with bytes either. So a dual string/bytes function would not be completely trivial.

Greg, can you convert bytes to strings, or strings to bytes, for your tests, or do you have non-ascii codes in your bytes? Otherwise, I think it might be better to write a new function 'unified_diff_bytes' that did exactly what you want than to try to make unified_diff accept sequences of bytes.
msg184438 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-03-18 08:39
At a glance, this just looks like a bug in difflib - it should use
different literals when handling bytes. (Given that difflib newline
processing assumes ASCII compatibility, a latin-1 based decode/encode
solution may also be viable).
msg184487 - (view) Author: Greg Ward (gward) (Python committer) Date: 2013-03-18 18:43
The original reproduction I posted was incorrect -- it makes difflib look worse than it should. (I passed strings rather than lists of strings.) Here is a more accurate version:

>>> import difflib
>>> a = [b'hello']
>>> b = [b'hello!']
>>> '\n'.join(line for line in difflib.unified_diff(a, b))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <genexpr>
  File "/home/greg/src/cpython/3.3/Lib/difflib.py", line 1223, in unified_diff
    yield '-' + line
TypeError: Can't convert 'bytes' object to str implicitly

So it still crashes, but the exception makes it pretty clear what the problem is.
msg184490 - (view) Author: Greg Ward (gward) (Python committer) Date: 2013-03-18 18:50
Replying to Terry Reedy:
> So a dual string/bytes function would not be completely trivial.

Correct. I have one working, but it makes my eyes bleed. I fail ashamed to have written it.

> Greg, can you convert bytes to strings, or strings to bytes

Nope. Here is the hypothetical use case: I have a text file written in Polish encoded in ISO-8859-1 committed to a Mercurial repository. (Or saved in a filesystem somewhere: doesn't really matter, except that Mercurial repositories are immutable, long-term, and *must* *not* *lose* *data*.) Then I decide I should play nicely with the rest of the world and transcode to UTF-8, so commit a new rev in UTF-8.

Years later, I need to look at the diff between those two old revisions. Rev 1 is a pile of ISO-8859-2 bytes, and rev 2 is a pile of UTF-8 bytes. The output of diff looks like

  - blah blah [iso-8859-2 bytes] blah
  + blah blah [utf-8 bytes] blah

Note this: the output of diff has some lines that are iso-8859-2 bytes and some that are utf-8 bytes. *There is no single encoding* that applies.

Note also that diff output must contain the exact original bytes, so that it can be consumed by patch. Diffs are read both by humans and by machines.

> Otherwise, I think it might be better to write a new function 
> 'unified_diff_bytes' that did exactly what you want than to try to 
> make unified_diff accept sequences of bytes.

Good idea. That might be much less revolting than what I have now. I'll give it a shot.
msg184534 - (view) Author: Greg Ward (gward) (Python committer) Date: 2013-03-18 21:48
OK I now have two competing patches. Both are disgusting, but in different ways.

1) http://hg.gerg.ca/cpython/rev/fcf3d27f20d9
   - factor out all the string constants
   - always concatenate, do not .format()

2) http://hg.gerg.ca/cpython/rev/cebefce2cfd4
   - copy {unified,context}_diff() to {unified,context}_diff_bytes()
   - this is a future maintenance headache, guaranteed!

Feedback welcome. If anyone can see a way to unify these two approaches, or a third way that sucks less, I'm all ears.
msg184535 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-03-18 22:02
> Necessary when comparing files with unknown or inconsistent encodings: 
> there's no way to convert everything to Unicode, so just leave it all
> bytes.

You could simply use the surrogateescape error handler.
msg184547 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-03-18 23:04
Since we don't need to worry about ASCII incompatible encodings (difflib will already have issues with such files due to the assumptions about newlines), it should be possible to use the same approach as that used in urllib.parse, but based on latin-1 rather than ascii.

It's the least bad option for this kind of use case (surrogateescape can be good too, but it doesn't work properly in this case where the two encodings may be different and we want to compare the raw bytes directly).

(changed scope of issue to reflect the subsequent discussion)
msg184566 - (view) Author: Greg Ward (gward) (Python committer) Date: 2013-03-19 00:24
Take 3: http://hg.gerg.ca/cpython/rev/78bdb10551ee
- uses surrogateescape as suggested by Antoine
- seems to work
msg184580 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-03-19 02:07
I was about to suggested a simplified version of the original one-function version but the new one is better. One change: name = lambda... is discouraged in the stdlib (Guido, pydev, a few years ago). Def statements require only 3 more chars and produce properly named objects for tracebacks.
  encode = lambda s: s
  def encode(s): return s

Under current rules, this is a 3.4 enhancement. For the context_diff and unified_diff doc change:
-  Compare a and b (lists of strings);
+  Compare string or bytes sequences a and b;   # (or)
+  Compare a and b (both sequences of strings or sequences of bytes);

Neither entry says anything at present about the type of from/tofile. Based on your patch, the following could go after the first sentence:
+"Arguments *fromfile* and *tofile* are normally strings but may be bytes if the items of *a* and *b* are."
msg184596 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-03-19 05:29
Ah, I forgot we didn't do within-line diffs. If we did those, then latin-1 would be less bad choice than ascii+surrogateescape. As it is, either should work in this case (since they just need to tunnel the raw bytes, and aren't being sliced at all).

I agree with most of Terry's comments, but think the case can be made this is a bug worth fixing in 3.3 (it's definitely borderline, but making it feasible to port Mercurial is a pretty big gain for a relatively tiny risk).

3.2 is about to enter security fix only mode though, so dropping that from the list of affected versions - while the fix should apply just fine, it's definitely not appropriate to make a change like this in the final planned maintenance release.
msg184597 - (view) Author: Greg Ward (gward) (Python committer) Date: 2013-03-19 05:31
Thanks for the review, Terry! Here is a revised patch, now on trunk:

  http://hg.gerg.ca/cpython/rev/6dedcdbe7cd5

I believe I have addressed all of your concerns.

Note also that the tests now highlight some dubious behaviour. Further feedback is welcome!

I'm happy to rebase onto 3.3 if folks generally agree it's safe. (It seems fine to me; IMHO being unable to handle bytes is a regression relative to Python 2.)
msg184699 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-03-19 22:51
The surrogate escape approach embodies the 3.x recommendation:
decode bytes to strings, manipulate strings, encode strings to bytes.
It also makes it possible to wrap the existing context/unified_diff functions, without touching them, with a simple 12 line function. Function bytes_diff avoids the complexities of mixing and unmixing strings and bytes that remain in Greg's latest patch.

I recommend the following: replace the simple test in the attached bytes_diff.py with Greg's unittest-based tests and adjust the __name__ == '__main__' incantation accordingly. Next upload to pypi to make it available to all 3.1-3.3 users. Then, after some minimal field testing, add the utility wrapper function to 3.4 difflib.  These steps would make moot for difflib the sub-issue of whether the 3.x design is a bug fixable in bugfix releases. We could even add a reference to the pypi module in the 3.2 and 3.3 docs.
msg185157 - (view) Author: Greg Ward (gward) (Python committer) Date: 2013-03-24 21:04
> I recommend the following: replace the simple test in the attached bytes_diff.py with 
> Greg's unittest-based tests and adjust the __name__ == '__main__' incantation 
> accordingly.

Latest patch, following Terry's suggestion: http://hg.gerg.ca/cpython/rev/6718d54cf9eb

Pro:
- does not touch unified_diff() or context_diff(), just adds a new function
- no question that this is a new feature, so no debate about what branch to commit on
- forces caller to know exactly what they are dealing with, strings or bytes

Con:
- not a bug fix, so 3.3 users won't get it ... but they can just copy the 
  implementation of diff_bytes() (or, as Terry suggests, it could live on PyPI)
- has explicit isinstance() checks (for a good reason: see the comments)
- also has more Pythonic "raise TypeError if s.decode raised AttributeError",
  which is friendlier to duck typing but inconsistent with the isinstance() checks
  used elsewhere in the patch

Overall I'm fairly happy with this. Not too thrilled with the explicit type checks; suggestions welcome.

Regarding Terry's suggestion of putting diff_bytes() on PyPI: meh. If the only project that needs this is Mercurial, that would be pointless. Mercurial doesn't have PyPI dependencies, and that is unlikely to change. This might be one of those rare cases where copying the code is easier than depending on it.
msg240717 - (view) Author: Augie Fackler (durin42) * Date: 2015-04-13 19:53
Friendly ping. With bytes formatting in Python 3.5a3, this is now the biggest pain port for getting our test runner working cleanly on Python 3.
msg240718 - (view) Author: Augie Fackler (durin42) * Date: 2015-04-13 19:54
(For values of "our" == "Mercurial".)
msg241098 - (view) Author: Greg Ward (gward) (Python committer) Date: 2015-04-15 13:32
OK I've revived my patch and rebased on latest trunk.

http://hg.gerg.ca/cpython/rev/13161c1d9c5f

Comments welcome. I'll push this in a couple of days if nobody objects.
msg241107 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2015-04-15 14:36
Some small comments:

* diff_bytes needs to be documented in the difflib docs and in Doc/whatsnew/3.5.rst.

* diff_bytes needs to be added to difflib.__all__

* This looks like a new feature to me, so it would be better to just commit it to the default branch.


+        except AttributeError:
+            raise TypeError('all arguments must be bytes, not %r' % s)

This could be changed to raise TypeError(...) from None


+            self.assertTrue(
+                isinstance(line, bytes),

assertIsInstance


+        try:
+            list(difflib.unified_diff(a, b, fna, fnb))
+            self.fail('expected TypeError')
+        except TypeError:
+            pass

with self.assertRaises(TypeError):
    list(difflib.unified_diff(a, b, fna, fnb))

looks more readable to me.
msg241172 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2015-04-15 22:22
I tried the [Create Patch] button.  Two problems: the result is about 90% concerned with other issues; it is not reviewable on Rietveld.  I will unlink it and upload a cut-down version.

Wtiht the test changes suggested by Berker, I agree that it is time to apply this, with whatever decision we make about 3.4.

I am sympathetic to the notion that there is a regression from 2.x.  There is precedent for adding a feature to fix a bug (in difflib, a new parameter for SequenceMatcher, for 2.7 3 (or thereabouts)).  However, doing so was contentious (discussed on pydev) and not meant to be routine.  The bug being fixed had been reported (as I remember) on four separate issues by four people and seconded by other people, so we really wanted the fix in 2.7.

Would the following compromise work for Mercurial?  The patch already adds a new private function _check_types.  For 3.4, also add _diff_bytes as a private function.  Merge both into 3.5.  Create a 3.5 patch that makes _diff_bytes public by renaming it to diff_bytes, adds the new tests, and documents the new feature.  The What's New entry could mention that the function was added privately in 3.4.4.
msg241174 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2015-04-15 22:24
Now the review button appears for the big patch.  Lets see if another submission makes it appear for the smaller version.
msg241175 - (view) Author: Augie Fackler (durin42) * Date: 2015-04-15 22:24
Changes to 3.4 aren't going to help Mercurial. Given that bytes formatting is new in 3.5, I won't be attempting a port that supports a version of Python 3 any older than 3.5. At this point, just a difflib.diff_bytes method in 3.5 would be sufficient to satisfy my needs.
msg241239 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-04-16 17:10
I think the convert to str -> process as str -> convert back to bytes approach is a good one - it's the same one we use in urllib.parse.

In this case, since we explicit need to handle mixed encodings, I also agree with the idea of using surrogate escape to make it possible to tunnel arbitrary bytes through the process, and expose that as a new module level API for Python 3.5.
msg241302 - (view) Author: Greg Ward (gward) (Python committer) Date: 2015-04-17 01:27
Just uploaded https://bugs.python.org/file39083/fa4c6160c518.diff. Pretty sure I've addressed all of @berker.peksag's review comments: thanks for that! 

I also fixed a number of subtle bugs in the tests. Pro tip: when asserting that something raises TypeError, inspect the exception message. There are many many ways that Python code can raise TypeError *other than* the condition you thought you were testing for. ;-)

I'm happy with this patch now unless anyone else spots problems.

@durin42: have you been trying this patch with your Mercurial-on-Python-3.5 patches? This would be a good time to re-update your difflib.py.
msg241304 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2015-04-17 01:38
Thanks, looks great. Two trivial comments about the documentation:

* Needs ``.. versionadded:: 3.5``
* *dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm)* -> ``dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm)``
msg241691 - (view) Author: Roundup Robot (python-dev) Date: 2015-04-21 01:20
New changeset 1764d42b340d by Greg Ward in branch 'default':
#17445: difflib: add diff_bytes(), to compare bytes rather than str
https://hg.python.org/cpython/rev/1764d42b340d
History
Date User Action Args
2015-04-21 03:06:42berker.peksagsetstage: commit review -> resolved
2015-04-21 01:22:35gwardsetstatus: open -> closed
resolution: fixed
2015-04-21 01:20:18python-devsetnosy: + python-dev
messages: + msg241691
2015-04-17 01:38:55berker.peksagsetmessages: + msg241304
versions: - Python 3.4
2015-04-17 01:27:49gwardsetmessages: + msg241302
2015-04-17 01:23:57gwardsetfiles: + fa4c6160c518.diff
2015-04-16 17:10:14ncoghlansetmessages: + msg241239
2015-04-15 22:24:02durin42setmessages: + msg241175
2015-04-15 22:24:01terry.reedysetmessages: + msg241174
2015-04-15 22:22:07terry.reedysetfiles: + 13161c1d9c5-difflibf.diff

messages: + msg241172
stage: patch review -> commit review
2015-04-15 21:28:08terry.reedysetfiles: + 13161c1d9c5f.diff
2015-04-15 14:36:18berker.peksagsetnosy: + berker.peksag
messages: + msg241107
2015-04-15 13:32:33gwardsetmessages: + msg241098
2015-04-13 23:31:27terry.reedysetversions: + Python 3.5, - Python 3.3
2015-04-13 19:54:03durin42setmessages: + msg240718
2015-04-13 19:53:49durin42setmessages: + msg240717
2013-03-24 21:04:20gwardsetassignee: gward
messages: + msg185157
2013-03-19 22:51:21terry.reedysetfiles: + bytes_diff.py

messages: + msg184699
2013-03-19 20:50:58r.david.murraysetfiles: - 6dedcdbe7cd5.diff
2013-03-19 20:32:23r.david.murraysetfiles: + 6dedcdbe7cd5.diff
keywords: + patch
2013-03-19 05:31:34gwardsetmessages: + msg184597
2013-03-19 05:29:30ncoghlansettype: behavior
stage: patch review
messages: + msg184596
versions: - Python 3.2
2013-03-19 02:07:39terry.reedysetmessages: + msg184580
2013-03-19 00:24:38gwardsetmessages: + msg184566
2013-03-18 23:04:44ncoghlansetmessages: + msg184547
title: Return the type you accept -> Handle bytes comparisons in difflib.Differ
2013-03-18 22:02:31pitrousetnosy: + pitrou
messages: + msg184535
2013-03-18 21:48:48gwardsethgrepos: + hgrepo179
messages: + msg184534
2013-03-18 18:50:27gwardsetmessages: + msg184490
2013-03-18 18:43:01gwardsetmessages: + msg184487
2013-03-18 08:39:11ncoghlansetmessages: + msg184438
2013-03-18 06:17:58terry.reedysetnosy: + terry.reedy
messages: + msg184428
2013-03-18 05:35:42r.david.murraysetmessages: + msg184424
2013-03-18 00:26:23gwardsetnosy: + durin42, gward
messages: + msg184409
2013-03-17 15:19:32barrysetmessages: + msg184382
2013-03-17 15:10:40r.david.murraysetnosy: + r.david.murray, ncoghlan
messages: + msg184381
2013-03-17 14:42:58barrycreate