msg152843 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-08 11:22 |
This is a feature I've wanted to use in too many times to remember. I've made a patch with an implementation, docs and a test. I've named the function rglob and tried to stay within the conventions of the glob package.
|
msg152846 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2012-02-08 12:10 |
I'm inclined to close this as a functional duplicate of http://bugs.python.org/issue13229
|
msg152847 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-08 12:47 |
I'd say it's very close to a duplicate but maybe isn't so. If walkdir is added then rglob can be implemented using it.
I'd say "rglob" to "walkdir" is like "urlopen" to "http.client". One is the stupid and simple function (that still has a bazillion use cases) and the other is the heavy lifting swiss army knife.
"file_paths(filtered_walk('.', included_files=['*.py']))" is a lot longer than "rglob('*.py')".
|
msg152849 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-08 12:49 |
> "file_paths(filtered_walk('.', included_files=['*.py']))" is a lot
> longer than "rglob('*.py')".
Agreed.
|
msg152852 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2012-02-08 12:55 |
A fair point indeed.
To follow the shutil naming convention (rmtree, copytree, and likely chmodtree, chowntree), a more appropriate name might be "globtree". (Thanks to string methods, the 'r' prefix doesn't read correctly to me: what does "globbing from the right" mean?)
|
msg152853 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-08 12:57 |
> To follow the shutil naming convention (rmtree, copytree, and likely
> chmodtree, chowntree), a more appropriate name might be "globtree".
> (Thanks to string methods, the 'r' prefix doesn't read correctly to
> me: what does "globbing from the right" mean?)
Well, if you put it in the glob module, it doesn't have to follow the
shutil naming convention :-)
(I prefer "rglob" myself)
|
msg152856 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2012-02-08 13:40 |
I can live with it either way - I just wanted to point out that our current examples of this kind of recursive filesystem access use a 'tree' suffix rather than an 'r' prefix.
|
msg152857 - (view) |
Author: Eli Bendersky (eli.bendersky) * |
Date: 2012-02-08 14:05 |
"file_paths(filtered_walk('.', included_files=['*.py']))" is a lot longer than "rglob('*.py')".
It is, but is that a good enough reason to have both? It can also be achieved with just a bit more code using the simple `os.walk`. I suppose there are a lot of instances of stdlib tools where we could add new tools that would make the code slightly shorter. However, this is not really faithful to the Python spirit, since it adds too many ways to do achieve the same effect, and ultimately confuses users.
That it adds additional maintenance burden on the coredevs goes without saying :-) Each such new burden should have a very good reason.
To conclude, personally I'm -1 on this, especially if `walkdir` eventually makes it into the stdlib.
|
msg152858 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-08 14:14 |
> "file_paths(filtered_walk('.', included_files=['*.py']))" is a lot
> longer than "rglob('*.py')".
>
>
> It is, but is that a good enough reason to have both?
It is. globbing is a well-known operation that many people expect to be
easily done.
> However, this is not really faithful to the Python spirit, since it
> adds too many ways to do achieve the same effect, and ultimately
> confuses users.
Which "Python spirit" are you talking about? We have many high-level
tools in the stdlib.
|
msg152868 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2012-02-08 15:21 |
There is an alternative: supporting ** syntax, e.g. '**/*.py', which should find all *.py files in the current directory and all descendents. At present glob('**/*.py') is equivalent to glob('*/*.py'), but we would say this behavior was undefined and the new behavior would be a new feature.
|
msg152871 - (view) |
Author: Eli Bendersky (eli.bendersky) * |
Date: 2012-02-08 15:43 |
>> It is. globbing is a well-known operation that many people expect to be easily done.
According to Wikipedia (http://en.wikipedia.org/wiki/Glob_%28programming%29) - "The noun "glob" is used to refer to a particular pattern, e.g. "use the glob *.log to match all those log files"".
IOW, globbing is usually understood as the act of expanding a pattern to the files it matches. Nothing in that implies recursive traversal of a directory tree. On the other hand, os.walk and/or walkdir suggest that in their name.
>> Which "Python spirit" are you talking about? We have many high-level
tools in the stdlib.
There should be one -- and preferably only one -- obvious way to do it.
Admittedly, we already have more than one, and a high-level tool is proposed with Nick's walkdir. Why add *yet another* high-level tool?
|
msg152873 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-08 15:52 |
> IOW, globbing is usually understood as the act of expanding a pattern
> to the files it matches. Nothing in that implies recursive traversal
> of a directory tree.
Still, that's a common need. "I want all Python files in a subtree".
> On the other hand, os.walk and/or walkdir suggest that in their name.
I don't know why "walk" is supposedly more recursive than "glob".
> Admittedly, we already have more than one, and a high-level tool is
> proposed with Nick's walkdir. Why add *yet another* high-level tool?
Because the walkdir spelling (IIUC) is longish, tedious and awkward.
I could see myself typing "rglob('*.py')" in a short script or an
interpreter session, without having to look up any kind of docs.
Certainly not the walkdir alternative (I've already forgotten what it
is).
|
msg152876 - (view) |
Author: Eli Bendersky (eli.bendersky) * |
Date: 2012-02-08 15:59 |
>> IOW, globbing is usually understood as the act of expanding a pattern
>> to the files it matches. Nothing in that implies recursive traversal
>> of a directory tree.
>
> Still, that's a common need. "I want all Python files in a subtree".
>
>> On the other hand, os.walk and/or walkdir suggest that in their name.
>
> I don't know why "walk" is supposedly more recursive than "glob".
Google "walk directory". First hit is a Rosetta code page with
*recursive* walking implemented in various languages. So I guess it
does have this connotation. Regardless, os.walk has been in Python for
ages, and it's always been the go-to tool for recursive traversal.
walkdir's name suggests the same.
>
>> Admittedly, we already have more than one, and a high-level tool is
>> proposed with Nick's walkdir. Why add *yet another* high-level tool?
>
> Because the walkdir spelling (IIUC) is longish, tedious and awkward.
> I could see myself typing "rglob('*.py')" in a short script or an
> interpreter session, without having to look up any kind of docs.
> Certainly not the walkdir alternative (I've already forgotten what it
> is).
walkdir is a new module proposal. If its API is tedious and awkward,
it should probably be improved *now* while it's in development. Adding
yet another tool that implements part of its functionality, winning a
golf tournament along the way, isn't the solution, IMHO.
|
msg152877 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2012-02-08 16:01 |
Feedback from Antoine on IRC about my syntax proposal: “The "**" meaning is not really universal like other quantifiers are. [...] (also, it would be quite harder to implement, I think)”
That and the compat issue makes me go in favor of a new function.
I’m not sure glob is the right place: when you use glob.glob, the search is rooted in the current directory, and you may have sub-directories in your pattern, e.g. 'Lib/*/__main__.py'. A function meaning “look for this file pattern recursively” would be IMO more at home in fnmatch.
|
msg152879 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-08 16:09 |
> Google "walk directory". First hit is a Rosetta code page with
> *recursive* walking implemented in various languages. So I guess it
> does have this connotation. Regardless, os.walk has been in Python for
> ages, and it's always been the go-to tool for recursive traversal.
> walkdir's name suggests the same.
You still haven't explained what your problem is with the idea of an
explicitly recursive glob (as both "rglob" and "globtree" suggest).
> walkdir is a new module proposal. If its API is tedious and awkward,
> it should probably be improved *now* while it's in development.
walkdir is not yet a module proposal, there's not even a PEP for it, and
it's in a very young state.
This issue has a working patch for rglob(), which is a single, obvious,
incremental addition to the existing glob module. If you want to discuss
walkdir, I suggest you do it in a separate issue.
(and, yes, rglob() can be reimplemented using walkdir later, if there is
a point in doing so)
|
msg152880 - (view) |
Author: Eli Bendersky (eli.bendersky) * |
Date: 2012-02-08 16:15 |
>> Google "walk directory". First hit is a Rosetta code page with
>> *recursive* walking implemented in various languages. So I guess it
>> does have this connotation. Regardless, os.walk has been in Python for
>> ages, and it's always been the go-to tool for recursive traversal.
>> walkdir's name suggests the same.
>
> You still haven't explained what your problem is with the idea of an
> explicitly recursive glob (as both "rglob" and "globtree" suggest).
>
The problem is that I prefer the walkdir approach, because it solves a
more general problem and overall more useful. This is also why I don't
see how it makes sense to stop discussing it here and focus on rglob.
They are related, after all!
Anyway, I'm not sure what else I can add to the discussion. I'm
starting to repeat myself, which means that I should just shut up :)
I've stated my preference, and I understand and respect yours. So
let's just see what others think.
|
msg152882 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-08 16:27 |
I'm trying the patch and its behaviour is strange:
>>> list(glob.rglob('setup.py'))
['setup.py']
>>> list(glob.rglob('setu*.py'))
[]
>>> list(glob.rglob('*/setu*.py'))
['./setup.py', './Mac/Tools/Doc/setup.py', './Tools/test2to3/setup.py', './Doc/includes/setup.py', './PC/example_nt/setup.py']
I can understand the first example (although that makes the documentation slightly incorrect, since you need an explicit "*" path component for the search to be recursive), but the second one looks straight wrong.
|
msg152894 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2012-02-08 17:18 |
>>> list(glob.rglob('*/setu*.py'))
It looks quite strange to me that '/' should be allowed in a function that recurses down directories (see my messages above). OTOH fnmatch is not really appropriate, contrary to my earlier feeling.
(Restoring my title change: as my messages were apparently overlooked, I assume that Eli did not revert my change on purpose but by replying to older email)
|
msg152898 - (view) |
Author: Eli Bendersky (eli.bendersky) * |
Date: 2012-02-08 18:59 |
Oops, Éric, sorry about the title. I didn't even notice :)
|
msg152917 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2012-02-08 23:57 |
I think it's important to be clear on what the walkdir API aims to be: a composable toolkit of utilities for directory tree processing. It's overall design is inspired directly by the itertools module.
Yes, it started life as a simple proposal to add shutil.filtered_walk (http://bugs.python.org/issue13229), but I soon realised that implementing this solely as a monolothic function would be foolish, since that approach isn't composable. What if you just wanted file filtering? Or depth limiting? Having it as a filtering toolkit lets you choose the exact filters you need for a given use case. walkdir.filtered_walk() is just an API for composing filtering pipelines without needing to pass the same information to multiple pipeline stages.
However, along with that itertools inspired iterator pipeline based design, I've also inherited Raymond's preference that particular *use cases* start life as recipes in the documentation.
A recursive glob is just a basic walkdir pipeline composition:
>>> from walkdir import file_paths, include_files
>>> def globtree(pattern, path='.'):
... return file_paths(include_files(os.walk(path), pattern))
Since filtered_walk() is just a pipeline builder, the composition can also be written:
>>> from walkdir import file_paths, filtered_walk
>>> def globtree(pattern, path='.'):
... return file_paths(filtered_walk(path, included_files=[pattern]))
That latter approach then suggests an alternative signature for globtree:
def globtree(*patterns, **kwds):
kwds.setdefault("top", ".")
return file_paths(filtered_walk(included_files=patterns, **kwds))
>>> print '\n'.join(sorted(globtree('*.rst')))
./index.rst
./py3k_binary_protocols.rst
./venv_bootstrap.rst
>>> print '\n'.join(sorted(globtree('*.rst', '*.py')))
./conf.py
./index.rst
./py3k_binary_protocols.rst
./venv_bootstrap.rst
On a somewhat related note, I'd also like to see us start concentrating higher level shell utilities in the shutil namespace so users don't have to check multiple locations for shell-related functionality quite so often (that is, I'd prefer shutil.globtree over glob.rglob).
|
msg152918 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-09 00:10 |
> However, along with that itertools inspired iterator pipeline based
> design, I've also inherited Raymond's preference that particular *use
> cases* start life as recipes in the documentation.
I think it's important to remember where we are coming from. Many people
complain that using os.walk is too cumbersome. Proposing another
cumbersome solution doesn't really help.
So I'm not against walkdir *per se*, but I'm -1 on the idea that walkdir
can eliminate the need for practical functions that anybody can use
*easily*.
> >>> print '\n'.join(sorted(globtree('*.rst', '*.py')))
> ./conf.py
> ./index.rst
> ./py3k_binary_protocols.rst
> ./venv_bootstrap.rst
I think it's rather nice, but it should be available as a stdlib
function rather than a "recipe" in the documentation.
Recipes are really overrated: they aren't tested, they aren't
maintained, they aren't part of a module's docstrings or
(pydoc-generated) contents, it's not obvious what kind of quality you
can expect from them (do they handle all cases correctly), it's not
obvious which Python versions they support. Raymond may like the idea,
but that doesn't make it a "good practice" Python should follow for its
batteries.
> On a somewhat related note, I'd also like to see us start
> concentrating higher level shell utilities in the shutil namespace so
> users don't have to check multiple locations for shell-related
> functionality quite so often (that is, I'd prefer shutil.globtree over
> glob.rglob).
Well, if glob() already lived in shutil, this decision would be a
no-brainer :) Having glob() in the glob module and globtree() in the
shutil module, though, looks a bit weird.
(I agree having a separate module for glob isn't ideal)
|
msg152922 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2012-02-09 01:58 |
We do have the option of aliasing glob.iglob as shutil.glob...
|
msg152925 - (view) |
Author: Eli Bendersky (eli.bendersky) * |
Date: 2012-02-09 04:15 |
>> Well, if glob() already lived in shutil, this decision would be a
no-brainer :) Having glob() in the glob module and globtree() in the
shutil module, though, looks a bit weird.
(I agree having a separate module for glob isn't ideal)
Would it be feasible to deprecate the 'glob' module, moving its functionality to shutil? In some future Python version, then, the module can be removed.
The same fate would go for fnmatch, I guess. There are too many modules lying around dealing with the same problems.
On a related note, the doc of glob explicitly mentions that it is implemented with os.listdir and fnmatch. Similarly, *if* the recursive glob gets accepted it should be implemented with walkdir (once that's in).
|
msg152943 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2012-02-09 12:07 |
This discussion (particularly my final globtree recipe) made me realise that the exact same approach would greatly improve the usability of the all_paths, file_paths and dir_paths iterators in walkdir [1]. Accordingly, walkdir 0.4 will let you write a recursive grep for ReST and Python source files as:
file_paths(top, included_files="*.py *.rst".split())
Scanning multiple directories will be as simple as:
file_paths(dir1, dir2, included_files="*.py *.rst".split())
[1] https://bitbucket.org/ncoghlan/walkdir/issue/15
|
msg152945 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-09 14:08 |
Thanks for the bug find Antoine, I worked surprisingly hard trying to make this right in more edge cases and while fixing it I noticed rglob/globtree has 3 options:
* Behave like a glob for every subdirectory. Meaning that every relative path gets a '*/' prepended to it. Eg rglob('c/d') started from the directory 'a' will yield 'a/b/c/d'.
* Behave like a glob for every subdirectory of the directory in the filter string. Meaning rglob('c/d') from dir 'a' won't yield 'a/b/c/d'. It would try to walk from 'a/c' and yield nothing if the directory 'c' doesn't exist in 'a'. Note that if the directory 'c' does exist then '/a/c/f/d' would be yielded. That seems kind of quirky to me.
* Behave like a filtered walk. Meaning that in order to yield files nested in subdirectories a wildcard must be introduced. Eg rglob('c/d') started from the directory 'a' won't yield 'a/b/c/d'. For that to occur you would need to use rglob('*c/d') or rglob('*/c/d'). What's more unfortunate is that rglob('d') doesn't yield 'a/b/c/d' which seems wrong. So I think for this we should special case paths that don't have path separators and prepend the "*/". Though some may argue it's wrong that rglob('d') yields 'a/b/c/d' even though rglob('c/d') won't yield it, I think that's the correct choice for this route.
Note that absolute paths with/without wildcards don't have this ambiguity. In both rel/abs wildcards should match directories and files alike.
Which option do you guys think would be best? I already have a fixed patch for option 1 and 3 but I'd rather hear your thoughts before I introduce either.
P.s. another slight issue I ran into is the fact that fnmatch doesn't ignore os.curdir:
>>> fnmatch.fnmatch('./a', 'a')
False
|
msg152946 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2012-02-09 14:49 |
I have to say that the non-obvious subtleties you point out in your rglob make me think I personally would probably opt to use Nick's module directly instead, so that I was sure what I was getting.
|
msg152949 - (view) |
Author: Eli Bendersky (eli.bendersky) * |
Date: 2012-02-09 15:18 |
> I have to say that the non-obvious subtleties you point out in your rglob make me think I personally would probably opt to use Nick's module directly instead, so that I was sure what I was getting.
>
> ----------
+1
|
msg152950 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-09 15:31 |
> I have to say that the non-obvious subtleties you point out in your rglob make me think I personally would probably opt to use Nick's module directly instead, so that I was sure what I was getting.
I didn't notice these corner cases initially because of their distance from the main use case. The main use-case is to glob the current directory or an absolute path tree using a wildcard, esp for finding all files of a given type. This leaves no ambiguity.
I'd like the edge cases of relative paths to behave in a documented and sensible way, but they pale in comparison to the usefulness of the proposal imho.
I believe we should decide what's the most useful and sensible behavior and have a shortcut for that.
|
msg152951 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2012-02-09 15:40 |
Hmm. Just to make it clear where I'm coming from, though, I should also point out that I use rdiff-backup, which uses the **/yadayada syntax, and I hate it any time I have to try to figure out what such a glob is going to actually match.
|
msg152952 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-09 15:46 |
> * Behave like a glob for every subdirectory. Meaning that every
> relative path gets a '*/' prepended to it. Eg rglob('c/d') started
> from the directory 'a' will yield 'a/b/c/d'.
That's what I would expect. That way, rglob('__init__.py') would find
all files named __init__.py beneath the current directory.
> P.s. another slight issue I ran into is the fact that fnmatch doesn't
> ignore os.curdir:
>
> >>> fnmatch.fnmatch('./a', 'a')
> False
Sounds ok. fnmatch is a low-level lexical thing.
|
msg152953 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-09 16:03 |
>> * Behave like a glob for every subdirectory. Meaning that every
>> relative path gets a '*/' prepended to it. Eg rglob('c/d') started
>> from the directory 'a' will yield 'a/b/c/d'.
> That's what I would expect. That way, rglob('__init__.py') would
> find all files named __init__.py beneath the current directory.
Perhaps we should make a single exemption for double dots eg rglob('../../__init__.py') starts the walk 2 folders out of the curdir and looks for '*/__init__.py'.
|
msg152955 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-09 16:16 |
> >> * Behave like a glob for every subdirectory. Meaning that every
> >> relative path gets a '*/' prepended to it. Eg rglob('c/d') started
> >> from the directory 'a' will yield 'a/b/c/d'.
>
> > That's what I would expect. That way, rglob('__init__.py') would
> > find all files named __init__.py beneath the current directory.
>
> Perhaps we should make a single exemption for double dots eg
> rglob('../../__init__.py') starts the walk 2 folders out of the curdir
> and looks for '*/__init__.py'.
This would be quirky. I don't think '..' should be treated specially.
(there's also the symlinks problem)
|
msg152957 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-09 16:27 |
> This would be quirky. I don't think '..' should be treated specially.
> (there's also the symlinks problem)
Again with 'a/b/c/d' and let's add a file 'a/b/png'.
If the curdir is 'c' and you use rglob('../pn*') you won won't find '../png' as you would be walking only in the curdir. I think that route would mean we should document double dots aren't supported.
|
msg152958 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-09 16:30 |
> Again with 'a/b/c/d' and let's add a file 'a/b/png'.
>
> If the curdir is 'c' and you use rglob('../pn*') you won won't find
> '../png' as you would be walking only in the curdir.
That depends how you implement it. If you detect that ".." exists and
glob for "pn*" inside it, you will probably find "../png".
|
msg152959 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-09 16:35 |
> That depends how you implement it. If you detect that ".." exists and
glob for "pn*" inside it, you will probably find "../png".
Yes, that's what I meant by a "single exemption for double dots". The solution should start the walk from wherever the double dots lead it to.
I believe we agree.
|
msg152960 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-09 16:36 |
> Yes, that's what I meant by a "single exemption for double dots". The
> solution should start the walk from wherever the double dots lead it
> to.
I don't think that's a single exemption; you should use the same
algorithm for all globs (e.g. "a/*.py"), shouldn't you?
|
msg152962 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-09 16:46 |
> you should use the same algorithm for all globs (e.g. "a/*.py"), shouldn't you?
That specific string would start the walk from the current directory IIUC.
|
msg152965 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-09 16:53 |
> > you should use the same algorithm for all globs (e.g. "a/*.py"), shouldn't you?
>
> That specific string would start the walk from the current directory IIUC.
Yes but would it match b/a/setup.py?
|
msg152966 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2012-02-09 17:39 |
How about having separate starting path and glob arguments, where the glob cannot contain any directory?
I'm -1 on this function as it stands. My vote could change if the final semantics are intuitive and unambiguous. (It's OK if getting the correct intuition requires understanding a brief explanation.)
|
msg152968 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-09 17:44 |
> My vote could change if the final semantics are intuitive and unambiguous.
How about "match the given glob from this directory and any
sub-directory"?
That sounds quite intuitive and unambiguous to me.
|
msg152971 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2012-02-09 18:03 |
So given
/home/a
/home/a/k.py
/home/a/c/j.py
/home/b/z.py
/home/b/c/f.py
and a current directory of /home/a, we'd have:
pattern matches
------- -------
*.py k.py, c/j.py
c/*.py c/j.py
c* c [?]
../*.py ?
../c/*.py ?
Thinking about those .. cases makes my brain hurt :) What does it mean to match '../*.py' when I'm recursing into the c subdirectory? What does it mean in the current directory, for that matter?
As you can see, your short explanation has left me with one case where I have a question, and one set of cases where my intuition completely fails.
--David
|
msg152982 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-09 20:36 |
Given
/home/a
/home/a/k.py
/home/a/c/j.py
/home/b/z.py
/home/b/c/f.py
and a current directory of /home/a, we'd have:
pattern matches
------- -------
*.py k.py, c/j.py
c/*.py c/j.py
c* c
../*.py ../a/k.py, ../a/c/j.py, ../b/z.py, ../b/c/f.py
../c/*.py ../b/c/f.py, ../a/c/j.py
For relative paths the double dots decide where to start walking and from then on you can imagine a glob on every subdir. In the last 2 examples the glob would have been '*.py' and 'c/*.py' respectively.
|
msg152986 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2012-02-09 21:01 |
Well, in that case I would expect that the argument 'c/*.py' would start walking in the c directory, but I definitely did not get that impression from Antoine's explanation of how the function works.
I again advocate separating the starting path specification from the glob specification, with the latter not allowing directories. A two argument function.
|
msg152990 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2012-02-09 22:12 |
Given the complexities proposed for the dir matching, I'm shifting back to a solid -1 on this. Trying to match multi-part directories with globs is a nightmare and I currently don't allow it at all in walkdir. Instead, dir filtering and file filtering are expressed separately.
In 0.4, walkdir will allow the following to search the current directory for all Python source files in subdirectories named 'c':
walk_files(included_files=["*.py"], included_dirs="c")
This discussion does mean I plan to add path filtering though, that works on the "dirpath" value directly (and *will* allow multi-part matches). So the "Python source files in subdirectories named 'd' of directories named 'c'" will be written something like:
walk_files(included_files=["*.py"], included_paths="*/c/d")
I also plan to explore options to make the pattern matching function configurable (i.e. so you can pass in something like "match=re.search")
|
msg152994 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-09 22:37 |
> This discussion does mean I plan to add path filtering though, that
> works on the "dirpath" value directly (and *will* allow multi-part
> matches). So the "Python source files in subdirectories named 'd' of
> directories named 'c'" will be written something like:
>
> walk_files(included_files=["*.py"], included_paths="*/c/d")
That's not different from rglobbing "*/c/d/*.py", is it? :)
|
msg152998 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2012-02-09 22:48 |
I don't know, is it? From what has been said so far I'd expect */c/d/*.py to look for *.py files in all c/d subdirectories of direct subdirectories of the current directory, and subdirectories of those c/d directories. But I wouldn't expect the c/d matching to go any deeper than that one level. Whereas with Nick's formulation, I would. And I know at a glance, without reading any documentation, what Nick's formulation does, whereas I very much don't with rglob.
|
msg152999 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-09 22:50 |
> I don't know, is it? From what has been said so far I'd expect
> */c/d/*.py to look for *.py files in all c/d subdirectories of direct
> subdirectories of the current directory, and subdirectories of those
> c/d directories. But I wouldn't expect the c/d matching to go any
> deeper than that one level.
That would be a normal glob, not a recursive glob. We already have
glob.glob().
|
msg153001 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-09 23:01 |
As R. David Murray has suggested I think there may be a middle ground.
def rglob(fn_filter, root='.'):
That would mean the default use case is still easy to remember as rglob('*.py') and also there aren't any explanations needed for how this function works.
Though I'm not sure which of these API's I like better.
|
msg153002 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2012-02-09 23:08 |
@antoine: no, my description involves recursion. It assumes that the path portion of the glob specifies the directories to *start* from, but that the filename glob portion then applies recursively to any of those start directories.
The alternative interpretation of the pattern, that it is fully matched against the CWD, takes you back to having to explain what '..' means when matched against a given location in the recursion, starting with the CWD. I clearly don't have a good intuition about that.
|
msg153003 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2012-02-09 23:09 |
Oh, yeah, and there's still the question of whether or not directories are matched by the terminal glob pattern, which I would naively expect they would be, in either interpretation, but I wouldn't be sure.
|
msg153004 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-09 23:10 |
> The alternative interpretation of the pattern, that it is fully
> matched against the CWD, takes you back to having to explain what '..'
> means when matched against a given location in the recursion, starting
> with the CWD. I clearly don't have a good intuition about that.
My alternative interpretation is that the pattern is being matched
against whatever subdir of the current dir (and the current dir itself).
The matching is not different from glob()'s.
|
msg153005 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2012-02-09 23:47 |
Ah, OK, so what you are saying is that rglob returns the concatenation of the results of running ls with the argument glob in each subdirectory of a walk starting with the current directory, except that the returned names have paths anchored in the current directory?
|
msg153006 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-10 00:42 |
> Ah, OK, so what you are saying is that rglob returns the concatenation
> of the results of running ls with the argument glob in each
> subdirectory of a walk starting with the current directory, except
> that the returned names have paths anchored in the current directory?
Yup.
|
msg153013 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2012-02-10 04:03 |
FTR, packaging has an extended glob function (not my code) which supports sets (with { , }) and recursivity (**).
|
msg153085 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2012-02-11 01:00 |
I would like to see this one API remain simple and leave more complex tasks to os.walk et al.
|
msg153192 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-12 10:07 |
Raymond Hettinger, by simple do you mean a single argument rglob function? Or do you mean you prefer glob doesn't get a new function?
|
msg153208 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2012-02-12 17:14 |
The last forumulation of what rglob does "apply the glob to the current directory and all subdirectories recursively, returning the joined list with filenames anchored in the current directory" is simple and intuitive enough for me. (I'm not sure it can be implemented using Nick's module.)
The discussions about starting directory and the meaning of '..' and implicitly prepending '*' totally confused me. If the formulation in my first paragraph is the intended semantics and the function actually implements those semantics (however it does it under the hood), then I'm fine with it.
|
msg153281 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-02-13 17:20 |
I noticed this implementation on PyPI http://packages.python.org/rglob/ which sort of has rglob defined as
def rglob(pattern, base='.'):
Which seems like the most comprehensible way of doing this, though not the most compact.
The code itself isn't in the best shape http://cpiekarski.com/2011/09/23/python-recursive-glob/ eg it uses list comprehensions instead of generators. But I think I like this API better as it really is easier to explain.
|
msg153300 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-02-13 20:48 |
> But I think I like this API better as it really is easier to explain.
I think you got it wrong. Looking at the implementation, it is no
different from the API proposed above, except that it also lets you
choose the start dir.
|
msg154600 - (view) |
Author: Michael.Elsdörfer (elsdoerfer) |
Date: 2012-02-29 07:56 |
This should absolutely be implemented as **:
- It is pretty much standard. Recursive globbing is not supported everywhere, but when it is, ** is used.
- A separate function will not allow me to let the *user* to decide when recursion should be used. I find this most important. When I need to find files internally, I always do so using os.walk etc. When I use glob, it is because I want to provide an interface to my users.
- The change to support ** is actually pretty trivial. I have implemented this as a module here: https://github.com/miracle2k/python-glob2/
- It's backwards-compatible - or close enough anyway. ** is currently perfectly nonsensical, making it a meaningful syntax element is acceptable, I think.
|
msg154603 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2012-02-29 08:30 |
FWIW, I've also come around to the point of view that it's worthwhile to provide stdlib support for the "**" convention (specifically due to the UI aspect that Michael mentions).
|
msg157278 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-04-01 12:14 |
I found this comprehensive description of the '**' convention at http://www.codeproject.com/Articles/2809/Recursive-patterned-File-Globbing that can translate directly to unittests.
I'd like to fix the patch for these specs but should it be in a new rglob function or in the existing glob.glob()? I think it should be a new one to avoid any edge-case compatibility concerns even though on face value there shouldn't be any.
|
msg157280 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-04-01 12:20 |
> I found this comprehensive description of the '**' convention at
> http://www.codeproject.com/Articles/2809/Recursive-patterned-File-Globbing that can translate directly to unittests.
>
> I'd like to fix the patch for these specs but should it be in a new
> rglob function or in the existing glob.glob()? I think it should be a
> new one to avoid any edge-case compatibility concerns even though on
> face value there shouldn't be any.
I think it should be the existing glob.glob(). We won't introduce a new
function any time we add a new syntactic feature in the glob
mini-language.
|
msg157281 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-04-01 12:25 |
I don't have a strong opinion on "rglob vs glob" so whichever way the majority here thinks is fine by me.
|
msg157287 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-04-01 13:42 |
For "**" globbing see http://ant.apache.org/manual/dirtasks.html#patterns .
If we extend pattern syntax of templates, why not implement Perl, Tcl or Bash extensions?
|
msg157290 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-04-01 14:04 |
On Sun, Apr 1, 2012 at 4:42 PM, Serhiy Storchaka <report@bugs.python.org> wrote:
> For "**" globbing see http://ant.apache.org/manual/dirtasks.html#patterns
They mention that "mypackage/test/ is interpreted as if it were mypackage/test/**" so that's not really an option. I'm pretty sure we should only recurse if "**" appears explicitly.
> If we extend pattern syntax of templates, why not implement Perl, Tcl or
> Bash extensions?
>
I'm not sure what you mean here but if it's that ##{} stuff then it should probably be discussed in a separate issue as it's not related to recursive globs.
|
msg159265 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-04-25 10:12 |
I added the doublestar functionality to iglob and updated the docs and tests.
Also, a few readability renames in that module were a long time coming.
I'd love to hear your feedback.
|
msg160057 - (view) |
Author: Yuval Greenfield (ubershmekel) * |
Date: 2012-05-06 06:21 |
So, anybody for or against this patch? I'd really like to see this feature make its way in...
|
msg160079 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-05-06 13:08 |
> So, anybody for or against this patch? I'd really like to see this
> feature make its way in...
I think the feature is useful, but someone needs to review the patch.
Sorry if it takes some time.
|
msg160743 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-05-15 16:42 |
I'm looking at the tests and I don't understand why '**/bcd/*' should match 'a/bcd/efg/ha'. Am I missing something?
|
msg176996 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-05 17:51 |
Here is a patch which implements recursive globbing which conforms to Bash globbing with "globstar" option.
For backward compatibility recursive globbing off by default and works only if new argument "recursive" is true (default is False). I am not sure this is a better variant. Possible the default should be True. '**' pattern is very unlikely in old code. However recursive globbing on arbitrary pattern and arbitrary tree is not safe, it can hang on recursive symlinks.
The patch contains changes from issue16618.
|
msg178810 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-01-02 12:32 |
Patch updated for current tip.
|
msg179947 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-01-14 16:02 |
I should add a symlink loop detecting to _rlistdir() as Antoine advised me on IRC.
|
msg179979 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-01-14 21:50 |
In fact glob() is already protected against an endless recursion (in the same way as Bash). The level of recursion is simply limited by the maximum length of the path. So I did not change the implementation, I have just added a test for symlink loop. I also corrected the other new tests so that they should not fail on the platform without symlinks.
|
msg203177 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-11-17 14:35 |
In updated patch fixed warning/errors when ran with -b or -bb options.
|
msg212878 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2014-03-07 14:06 |
Oops, Python 3.4 has ** support in pathlib, but we missed Serhiy's patch for the glob module itself. We should resolve that discrepancy for 3.5 :)
|
msg226422 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2014-09-05 12:59 |
Could you make a review Nick?
|
msg226459 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2014-09-05 22:20 |
Mostly looks good to me, just two comments:
1. Is there a reason the helper function is "glob2" rather than either "_glob2" or else something more self-documenting?
2. "match a files" in the docs and docstrings doesn't read correctly. "match any files", perhaps?
|
msg226471 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2014-09-06 05:52 |
Thank you Nick.
> 1. Is there a reason the helper function is "glob2" rather than either
> "_glob2" or else something more self-documenting?
Only consistency with other helper functions (glob0, glob1).
> 2. "match a files" in the docs and docstrings doesn't read correctly. "match
> any files", perhaps?
Of course. This is just a typo.
|
msg226476 - (view) |
Author: Nick Coghlan (ncoghlan) * |
Date: 2014-09-06 07:56 |
Looks good to me!
|
msg226757 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2014-09-11 09:23 |
New changeset ff4b9d654691 by Serhiy Storchaka in branch 'default':
Issue #13968: The glob module now supports recursive search in
http://hg.python.org/cpython/rev/ff4b9d654691
|
msg226759 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2014-09-11 09:25 |
Thank you for your review Nick.
|
msg226761 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2014-09-11 10:58 |
The test failed on a buildbot, I reopen the issue.
http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/10607/steps/test/logs/stdio
======================================================================
FAIL: test_selflink (test.test_glob.SymlinkLoopGlobTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_glob.py", line 284, in test_selflink
self.assertIn(path, results)
AssertionError: '@test_23056_tmp_dir/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/file' not found in {'noodly2', '@test_23056_tmp-\udcff.py', '__pycache__'}
|
msg226768 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2014-09-11 12:34 |
New changeset 180f5bf7d1b9 by Serhiy Storchaka in branch 'default':
Issue #13968: Fixed newly added recursive glob test.
http://hg.python.org/cpython/rev/180f5bf7d1b9
|
msg226769 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2014-09-11 12:41 |
Thank you Victor. The test was failed also when run it directly, omitting the
test.regrtest module (which run a test inside temporary directory):
./python Lib/test/test_glob.py
Now it is fixed.
However perhaps we should consider as a bug if a test ran by regrtest doesn't
clean created files or directories ('noodly2', '@test_23056_tmp-\udcff.py', and
'__pycache__' are created by some previous tests).
|
msg226778 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2014-09-11 14:31 |
> However perhaps we should consider as a bug if a test ran by regrtest doesn't clean created files or directories
=> yes, I opened the issue #22390.
|
msg340696 - (view) |
Author: Marc (marc-h38) * |
Date: 2019-04-23 07:49 |
Please review one word documentation change at https://github.com/python/cpython/pull/12918 to clarify that recursive glob ** follows symbolic links to directories.
|