Message 185799 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pje
Recipients	Arfrever, eric.smith, eric.snow, jerub, jpaugh, ncoghlan, pje, ronaldoussoren, serhiy.storchaka
Date	2013-04-02.03:25:14
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1364873116.63.0.480916025149.issue14905@psf.upfronthosting.co.za>
In-reply-to

Content
Just a note: the zip files produced by the distutils and friends (sdist, bdist_dumb, eggs) do not include entries for plain directories. I would guess that this is also true for wheels at the moment, unless something was specifically done to work around this property of distutils-generated zip files. So ISTM the right thing to do is to synthesize the entries at directory read time, when they're being looped over anyway. Reviewing the patch, there is a performance optimization possible by making a slight change to the algorithm. Currently the patch loops from the start of the string to the end, looking for path prefixes. This means that the total overall performance is determined by the length of the strings and especially the average directory depth. However, there is a significant shortcut possible: looping from the end of each string to the beginning, it's possible to break out of the loop if the prefix has already been seen -- thus saving (depth-1) dictionary lookups in the average case, and only looking at the characters in the base filename, unless a new directory is encountered... for a typical overhead of one unicode substring, dictionary lookup, and strrchr per zipfile directory entry. (Which is very small compared to what else is going on at that point in the process.) To elaborate, if you have paths of the form: x/y/a x/y/b x/y/c/d Then when processing 'x/y/a', you would first process x/y -- it's not in the dict, add it. Then x -- not in the dict, add it. Then you go to x/y/b, your first parent is x/y again -- but since it's in the dict you skip it, and don't even bother with the x. Next you see x/y/c, which is not in the dict, so you add it, then x/y, which is, so you break out of the loop for that item. Basically, about all that would change would be the for() loop starting at the end of the string and going to the beginning, with the loop position still representing the end of the prefix to be extracted. And the PyDict_Contains check would result in a break rather than a continue. So, if the only concern keeping the patch from being accepted is that it adds to startup time, this approach would cut down quite a bit on the overhead for generating the path information, in cases of repeated prefixes. (And in the common cases for zipfile use on sys.path, one would expect to see a lot of common prefixes, if only for package names.)

Just a note: the zip files produced by the distutils and friends (sdist, bdist_dumb, eggs) do not include entries for plain directories.  I would guess that this is also true for wheels at the moment, unless something was specifically done to work around this property of distutils-generated zip files.  So ISTM the right thing to do is to synthesize the entries at directory read time, when they're being looped over anyway.

Reviewing the patch, there is a performance optimization possible by making a slight change to the algorithm.  Currently the patch loops from the start of the string to the end, looking for path prefixes.  This means that the total overall performance is determined by the length of the strings and especially the average directory depth.

However, there is a significant shortcut possible: looping from the *end* of each string to the beginning, it's possible to break out of the loop if the prefix has already been seen -- thus saving (depth-1) dictionary lookups in the average case, and only looking at the characters in the base filename, unless a new directory is encountered... for a typical overhead of one unicode substring, dictionary lookup, and strrchr per zipfile directory entry.  (Which is very small compared to what else is going on at that point in the process.)

To elaborate, if you have paths of the form:

x/y/a
x/y/b
x/y/c/d

Then when processing 'x/y/a', you would first process x/y -- it's not in the dict, add it.  Then x -- not in the dict, add it.  Then you go to x/y/b, your first parent is x/y again -- but since it's in the dict you skip it, and don't even bother with the x.  Next you see x/y/c, which is not in the dict, so you add it, then x/y, which is, so you break out of the loop for that item.

Basically, about all that would change would be the for() loop starting at the end of the string and going to the beginning, with the loop position still representing the end of the prefix to be extracted.  And the PyDict_Contains check would result in a break rather than a continue.

So, if the only concern keeping the patch from being accepted is that it adds to startup time, this approach would cut down quite a bit on the overhead for generating the path information, in cases of repeated prefixes.  (And in the common cases for zipfile use on sys.path, one would expect to see a lot of common prefixes, if only for package names.)

History
Date	User	Action	Args
2013-04-02 03:25:16	pje	set	recipients: + pje, ronaldoussoren, ncoghlan, jerub, eric.smith, Arfrever, eric.snow, serhiy.storchaka, jpaugh
2013-04-02 03:25:16	pje	set	messageid: <1364873116.63.0.480916025149.issue14905@psf.upfronthosting.co.za>
2013-04-02 03:25:16	pje	link	issue14905 messages
2013-04-02 03:25:14	pje	create