Message 184490 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gward
Recipients	barry, durin42, gward, ncoghlan, r.david.murray, terry.reedy
Date	2013-03-18.18:50:26
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1363632627.1.0.872503303943.issue17445@psf.upfronthosting.co.za>
In-reply-to

Content
Replying to Terry Reedy: > So a dual string/bytes function would not be completely trivial. Correct. I have one working, but it makes my eyes bleed. I fail ashamed to have written it. > Greg, can you convert bytes to strings, or strings to bytes Nope. Here is the hypothetical use case: I have a text file written in Polish encoded in ISO-8859-1 committed to a Mercurial repository. (Or saved in a filesystem somewhere: doesn't really matter, except that Mercurial repositories are immutable, long-term, and must not lose data.) Then I decide I should play nicely with the rest of the world and transcode to UTF-8, so commit a new rev in UTF-8. Years later, I need to look at the diff between those two old revisions. Rev 1 is a pile of ISO-8859-2 bytes, and rev 2 is a pile of UTF-8 bytes. The output of diff looks like - blah blah [iso-8859-2 bytes] blah + blah blah [utf-8 bytes] blah Note this: the output of diff has some lines that are iso-8859-2 bytes and some that are utf-8 bytes. There is no single encoding that applies. Note also that diff output must contain the exact original bytes, so that it can be consumed by patch. Diffs are read both by humans and by machines. > Otherwise, I think it might be better to write a new function > 'unified_diff_bytes' that did exactly what you want than to try to > make unified_diff accept sequences of bytes. Good idea. That might be much less revolting than what I have now. I'll give it a shot.

Replying to Terry Reedy:
> So a dual string/bytes function would not be completely trivial.

Correct. I have one working, but it makes my eyes bleed. I fail ashamed to have written it.

> Greg, can you convert bytes to strings, or strings to bytes

Nope. Here is the hypothetical use case: I have a text file written in Polish encoded in ISO-8859-1 committed to a Mercurial repository. (Or saved in a filesystem somewhere: doesn't really matter, except that Mercurial repositories are immutable, long-term, and *must* *not* *lose* *data*.) Then I decide I should play nicely with the rest of the world and transcode to UTF-8, so commit a new rev in UTF-8.

Years later, I need to look at the diff between those two old revisions. Rev 1 is a pile of ISO-8859-2 bytes, and rev 2 is a pile of UTF-8 bytes. The output of diff looks like

  - blah blah [iso-8859-2 bytes] blah
  + blah blah [utf-8 bytes] blah

Note this: the output of diff has some lines that are iso-8859-2 bytes and some that are utf-8 bytes. *There is no single encoding* that applies.

Note also that diff output must contain the exact original bytes, so that it can be consumed by patch. Diffs are read both by humans and by machines.

> Otherwise, I think it might be better to write a new function 
> 'unified_diff_bytes' that did exactly what you want than to try to 
> make unified_diff accept sequences of bytes.

Good idea. That might be much less revolting than what I have now. I'll give it a shot.

History
Date	User	Action	Args
2013-03-18 18:50:27	gward	set	recipients: + gward, barry, terry.reedy, ncoghlan, durin42, r.david.murray
2013-03-18 18:50:27	gward	set	messageid: <1363632627.1.0.872503303943.issue17445@psf.upfronthosting.co.za>
2013-03-18 18:50:27	gward	link	issue17445 messages
2013-03-18 18:50:26	gward	create