Message 187191 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	isoschiz
Recipients	christian.heimes, flox, isoschiz, jcea, pitrou, r.david.murray, serhiy.storchaka, sijinjoseph
Date	2013-04-17.18:39:39
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<516EEC5C.3040601@ensoft.co.uk>
In-reply-to	<1366222906.2556.7.camel@fsol>

Content
> Using a trick with struct.unpack() has very unpleasant side effect. > It might be a few speed up encoding, but creates the Struct object > with the size is many times larger than the size of the processed > data. Worse, this object is cached and continues to consume memory. > Since the size of the data most likely will be unique, almost every > call of b85encode creates a new object. This will lead to memory > leaks. Can you elaborate on this? What leakage is there? I assume this is some implementation quirk of the struct module that I'm not aware of. > Le mercredi 17 avril 2013 à 18:14 +0000, Serhiy Storchaka a écrit : >> I think we can provide a universal solution compatible (with some >> pre/postprocessing) with both variants. Enclose encoded data in <~ >> and ~> or not, and at which column wrap an encoded data. Padding >> can be easy implemented as preprocessing (data + (-len(data)) % 4 * >> b'\0'). > > That's ok with me. It's just more work for whoever does it :-) As I mentioned in one of my previous comments, I was trying very hard not to touch the Mercurial solution (b85(en\|de)code in the latest patch), and just copy it wholesale. Mostly, I don't really like the way the solution reads (unpythonic in my eyes), but can understand that for this kind of thing that might be the best way. In my solution (a85(en\|de)code) I wrote it from scratch in what I felt was a readable way. I can quite easily extend my version to support your description of the btoa/atob version (i.e. no bracketing, always pad, always wrap output). I'm less convinced it's sensible to merge the ascii85 implementations and the Mercurial b85 one. If you really want that though, I would be in favour of using my a85 implementation and just changing the encode inner function to use the lookup table. (we can do all this independently of the function names, which I think Antoine and I are agreed should be separate for the different implementations) >> As for Git/Mercurial's base85, what other applications use this >> encoding? > > I don't know, but they use it to produce binary diffs ("diff" chunks > of binary files), so any application wanting to parse Mercurial/Git > diffs would have to recognize base85 data. > > (and I also like that the Mercurial/Git variant is the simpler of > all 3 :-)) I actually prefer the Ascii85 one for the simplicity of the encoding (shift base 85 chunks of the input by 33 to get into the printable ascii range) rather than the clunky lookup table approach. À chacun son goût. :-)

> Using a trick with struct.unpack() has very unpleasant side effect.
> It might be a few speed up encoding, but creates the Struct object
> with the size is many times larger than the size of the processed
> data. Worse, this object is cached and continues to consume memory.
> Since the size of the data most likely will be unique, almost every
> call of b85encode creates a new object. This will lead to memory
> leaks.

Can you elaborate on this? What leakage is there? I assume this is some 
implementation quirk of the struct module that I'm not aware of.

> Le mercredi 17 avril 2013 à 18:14 +0000, Serhiy Storchaka a écrit :
>> I think we can provide a universal solution compatible (with some
>> pre/postprocessing) with both variants. Enclose encoded data in <~
>> and ~> or not, and at which column wrap an encoded data. Padding
>> can be easy implemented as preprocessing (data + (-len(data)) % 4 *
>> b'\0').
>
> That's ok with me. It's just more work for whoever does it :-)

As I mentioned in one of my previous comments, I was trying very hard 
not to touch the Mercurial solution (b85(en|de)code in the latest 
patch), and just copy it wholesale. Mostly, I don't really like the way 
the solution reads (unpythonic in my eyes), but can understand that for 
this kind of thing that might be the best way.

In my solution (a85(en|de)code) I wrote it from scratch in what I felt 
was a readable way. I can quite easily extend my version to support your 
description of the btoa/atob version (i.e. no bracketing, always pad, 
always wrap output).

I'm less convinced it's sensible to merge the ascii85 implementations 
and the Mercurial b85 one. If you really want that though, I would be in 
favour of using my a85 implementation and just changing the encode inner 
function to use the lookup table.

(we can do all this independently of the function names, which I think 
Antoine and I are agreed should be separate for the different 
implementations)

>> As for Git/Mercurial's base85, what other applications use this
>> encoding?
>
> I don't know, but they use it to produce binary diffs ("diff" chunks
> of binary files), so any application wanting to parse Mercurial/Git
> diffs would have to recognize base85 data.
>
> (and I also like that the Mercurial/Git variant is the simpler of
> all 3 :-))

I actually prefer the Ascii85 one for the simplicity of the encoding 
(shift base 85 chunks of the input by 33 to get into the printable ascii 
range) rather than the clunky lookup table approach. À chacun son goût. :-)

History
Date	User	Action	Args
2013-04-17 18:39:40	isoschiz	set	recipients: + isoschiz, jcea, pitrou, christian.heimes, r.david.murray, flox, sijinjoseph, serhiy.storchaka
2013-04-17 18:39:40	isoschiz	link	issue17618 messages
2013-04-17 18:39:39	isoschiz	create