classification
Title: Support recursive globs
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: 16618 Superseder:
Assigned To: serhiy.storchaka Nosy List: cvrebert, elsdoerfer, eric.araujo, giampaolo.rodola, loewis, ncoghlan, pitrou, r.david.murray, rhettinger, serhiy.storchaka, ubershmekel
Priority: normal Keywords: needs review, patch

Created on 2012-02-08 11:22 by ubershmekel, last changed 2014-04-04 13:19 by serhiy.storchaka.

Files
File name Uploaded Description Edit
rglob.patch ubershmekel, 2012-02-08 11:22 review
glob.doublestars.patch ubershmekel, 2012-04-25 10:12 review
glob_recursive_2.patch serhiy.storchaka, 2013-01-02 12:32 review
glob_recursive_3.patch serhiy.storchaka, 2013-01-14 21:50 review
glob_recursive_4.patch serhiy.storchaka, 2013-11-17 14:35 review
Messages (76)
msg152843 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-08 11:22
This is a feature I've wanted to use in too many times to remember. I've made a patch with an implementation, docs and a test. I've named the function rglob and tried to stay within the conventions of the glob package.
msg152846 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-08 12:10
I'm inclined to close this as a functional duplicate of http://bugs.python.org/issue13229
msg152847 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-08 12:47
I'd say it's very close to a duplicate but maybe isn't so. If walkdir is added then rglob can be implemented using it.

I'd say "rglob" to "walkdir" is like "urlopen" to "http.client". One is the stupid and simple function (that still has a bazillion use cases) and the other is the heavy lifting swiss army knife.

"file_paths(filtered_walk('.', included_files=['*.py']))" is a lot longer than "rglob('*.py')".
msg152849 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-08 12:49
> "file_paths(filtered_walk('.', included_files=['*.py']))" is a lot
> longer than "rglob('*.py')".

Agreed.
msg152852 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-08 12:55
A fair point indeed.

To follow the shutil naming convention (rmtree, copytree, and likely chmodtree, chowntree), a more appropriate name might be "globtree". (Thanks to string methods, the 'r' prefix doesn't read correctly to me: what does "globbing from the right" mean?)
msg152853 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-08 12:57
> To follow the shutil naming convention (rmtree, copytree, and likely
> chmodtree, chowntree), a more appropriate name might be "globtree".
> (Thanks to string methods, the 'r' prefix doesn't read correctly to
> me: what does "globbing from the right" mean?)

Well, if you put it in the glob module, it doesn't have to follow the
shutil naming convention :-)
(I prefer "rglob" myself)
msg152856 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-08 13:40
I can live with it either way - I just wanted to point out that our current examples of this kind of recursive filesystem access use a 'tree' suffix rather than an 'r' prefix.
msg152857 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-08 14:05
"file_paths(filtered_walk('.', included_files=['*.py']))" is a lot longer than "rglob('*.py')".


It is, but is that a good enough reason to have both? It can also be achieved with just a bit more code using the simple `os.walk`. I suppose there are a lot of instances of stdlib tools where we could add new tools that would make the code slightly shorter. However, this is not really faithful to the Python spirit, since it adds too many ways to do achieve the same effect, and ultimately confuses users.

That it adds additional maintenance burden on the coredevs goes without saying :-) Each such new burden should have a very good reason.

To conclude, personally I'm -1 on this, especially if `walkdir` eventually makes it into the stdlib.
msg152858 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-08 14:14
> "file_paths(filtered_walk('.', included_files=['*.py']))" is a lot
> longer than "rglob('*.py')".
> 
> 
> It is, but is that a good enough reason to have both?

It is. globbing is a well-known operation that many people expect to be
easily done.

> However, this is not really faithful to the Python spirit, since it
> adds too many ways to do achieve the same effect, and ultimately
> confuses users.

Which "Python spirit" are you talking about? We have many high-level
tools in the stdlib.
msg152868 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-08 15:21
There is an alternative: supporting ** syntax, e.g. '**/*.py', which should find all *.py files in the current directory and all descendents.  At present glob('**/*.py') is equivalent to glob('*/*.py'), but we would say this behavior was undefined and the new behavior would be a new feature.
msg152871 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-08 15:43
>> It is. globbing is a well-known operation that many people expect to be easily done.

According to Wikipedia (http://en.wikipedia.org/wiki/Glob_%28programming%29) - "The noun "glob" is used to refer to a particular pattern, e.g. "use the glob *.log to match all those log files"". 

IOW, globbing is usually understood as the act of expanding a pattern to the files it matches. Nothing in that implies recursive traversal of a directory tree. On the other hand, os.walk and/or walkdir suggest that in their name.

>> Which "Python spirit" are you talking about? We have many high-level
tools in the stdlib.

There should be one -- and preferably only one -- obvious way to do it. 

Admittedly, we already have more than one, and a high-level tool is proposed with Nick's walkdir. Why add *yet another* high-level tool?
msg152873 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-08 15:52
> IOW, globbing is usually understood as the act of expanding a pattern
> to the files it matches. Nothing in that implies recursive traversal
> of a directory tree.

Still, that's a common need. "I want all Python files in a subtree".

> On the other hand, os.walk and/or walkdir suggest that in their name.

I don't know why "walk" is supposedly more recursive than "glob".

> Admittedly, we already have more than one, and a high-level tool is
> proposed with Nick's walkdir. Why add *yet another* high-level tool?

Because the walkdir spelling (IIUC) is longish, tedious and awkward.
I could see myself typing "rglob('*.py')" in a short script or an
interpreter session, without having to look up any kind of docs.
Certainly not the walkdir alternative (I've already forgotten what it
is).
msg152876 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-08 15:59
>> IOW, globbing is usually understood as the act of expanding a pattern
>> to the files it matches. Nothing in that implies recursive traversal
>> of a directory tree.
>
> Still, that's a common need. "I want all Python files in a subtree".
>
>> On the other hand, os.walk and/or walkdir suggest that in their name.
>
> I don't know why "walk" is supposedly more recursive than "glob".

Google "walk directory". First hit is a Rosetta code page with
*recursive* walking implemented in various languages. So I guess it
does have this connotation. Regardless, os.walk has been in Python for
ages, and it's always been the go-to tool for recursive traversal.
walkdir's name suggests the same.

>
>> Admittedly, we already have more than one, and a high-level tool is
>> proposed with Nick's walkdir. Why add *yet another* high-level tool?
>
> Because the walkdir spelling (IIUC) is longish, tedious and awkward.
> I could see myself typing "rglob('*.py')" in a short script or an
> interpreter session, without having to look up any kind of docs.
> Certainly not the walkdir alternative (I've already forgotten what it
> is).

walkdir is a new module proposal. If its API is tedious and awkward,
it should probably be improved *now* while it's in development. Adding
yet another tool that implements part of its functionality, winning a
golf tournament along the way, isn't the solution, IMHO.
msg152877 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-08 16:01
Feedback from Antoine on IRC about my syntax proposal: “The "**" meaning is not really universal like other quantifiers are. [...] (also, it would be quite harder to implement, I think)”

That and the compat issue makes me go in favor of a new function.

I’m not sure glob is the right place: when you use glob.glob, the search is rooted in the current directory, and you may have sub-directories in your pattern, e.g. 'Lib/*/__main__.py'.  A function meaning “look for this file pattern recursively” would be IMO more at home in fnmatch.
msg152879 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-08 16:09
> Google "walk directory". First hit is a Rosetta code page with
> *recursive* walking implemented in various languages. So I guess it
> does have this connotation. Regardless, os.walk has been in Python for
> ages, and it's always been the go-to tool for recursive traversal.
> walkdir's name suggests the same.

You still haven't explained what your problem is with the idea of an
explicitly recursive glob (as both "rglob" and "globtree" suggest).

> walkdir is a new module proposal. If its API is tedious and awkward,
> it should probably be improved *now* while it's in development.

walkdir is not yet a module proposal, there's not even a PEP for it, and
it's in a very young state.

This issue has a working patch for rglob(), which is a single, obvious,
incremental addition to the existing glob module. If you want to discuss
walkdir, I suggest you do it in a separate issue.

(and, yes, rglob() can be reimplemented using walkdir later, if there is
a point in doing so)
msg152880 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-08 16:15
>> Google "walk directory". First hit is a Rosetta code page with
>> *recursive* walking implemented in various languages. So I guess it
>> does have this connotation. Regardless, os.walk has been in Python for
>> ages, and it's always been the go-to tool for recursive traversal.
>> walkdir's name suggests the same.
>
> You still haven't explained what your problem is with the idea of an
> explicitly recursive glob (as both "rglob" and "globtree" suggest).
>

The problem is that I prefer the walkdir approach, because it solves a
more general problem and overall more useful. This is also why I don't
see how it makes sense to stop discussing it here and focus on rglob.
They are related, after all!

Anyway, I'm not sure what else I can add to the discussion. I'm
starting to repeat myself, which means that I should just shut up :)

I've stated my preference, and I understand and respect yours. So
let's just see what others think.
msg152882 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-08 16:27
I'm trying the patch and its behaviour is strange:

>>> list(glob.rglob('setup.py'))
['setup.py']
>>> list(glob.rglob('setu*.py'))
[]
>>> list(glob.rglob('*/setu*.py'))
['./setup.py', './Mac/Tools/Doc/setup.py', './Tools/test2to3/setup.py', './Doc/includes/setup.py', './PC/example_nt/setup.py']

I can understand the first example (although that makes the documentation slightly incorrect, since you need an explicit "*" path component for the search to be recursive), but the second one looks straight wrong.
msg152894 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-08 17:18
>>> list(glob.rglob('*/setu*.py'))

It looks quite strange to me that '/' should be allowed in a function that recurses down directories (see my messages above).  OTOH fnmatch is not really appropriate, contrary to my earlier feeling.

(Restoring my title change: as my messages were apparently overlooked, I assume that Eli did not revert my change on purpose but by replying to older email)
msg152898 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-08 18:59
Oops, Éric, sorry about the title. I didn't even notice :)
msg152917 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-08 23:57
I think it's important to be clear on what the walkdir API aims to be: a composable toolkit of utilities for directory tree processing. It's overall design is inspired directly by the itertools module.

Yes, it started life as a simple proposal to add shutil.filtered_walk (http://bugs.python.org/issue13229), but I soon realised that implementing this solely as a monolothic function would be foolish, since that approach isn't composable. What if you just wanted file filtering? Or depth limiting? Having it as a filtering toolkit lets you choose the exact filters you need for a given use case. walkdir.filtered_walk() is just an API for composing filtering pipelines without needing to pass the same information to multiple pipeline stages.

However, along with that itertools inspired iterator pipeline based design, I've also inherited Raymond's preference that particular *use cases* start life as recipes in the documentation.

A recursive glob is just a basic walkdir pipeline composition:

>>> from walkdir import file_paths, include_files
>>> def globtree(pattern, path='.'):
...     return file_paths(include_files(os.walk(path), pattern))
        
Since filtered_walk() is just a pipeline builder, the composition can also be written:

>>> from walkdir import file_paths, filtered_walk
>>> def globtree(pattern, path='.'):
...     return file_paths(filtered_walk(path, included_files=[pattern]))

That latter approach then suggests an alternative signature for globtree:

def globtree(*patterns, **kwds):
    kwds.setdefault("top", ".")
    return file_paths(filtered_walk(included_files=patterns, **kwds))

>>> print '\n'.join(sorted(globtree('*.rst')))
./index.rst
./py3k_binary_protocols.rst
./venv_bootstrap.rst

>>> print '\n'.join(sorted(globtree('*.rst', '*.py')))
./conf.py
./index.rst
./py3k_binary_protocols.rst
./venv_bootstrap.rst

On a somewhat related note, I'd also like to see us start concentrating higher level shell utilities in the shutil namespace so users don't have to check multiple locations for shell-related functionality quite so often (that is, I'd prefer shutil.globtree over glob.rglob).
msg152918 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-09 00:10
> However, along with that itertools inspired iterator pipeline based
> design, I've also inherited Raymond's preference that particular *use
> cases* start life as recipes in the documentation.

I think it's important to remember where we are coming from. Many people
complain that using os.walk is too cumbersome. Proposing another
cumbersome solution doesn't really help.

So I'm not against walkdir *per se*, but I'm -1 on the idea that walkdir
can eliminate the need for practical functions that anybody can use
*easily*.

> >>> print '\n'.join(sorted(globtree('*.rst', '*.py')))
> ./conf.py
> ./index.rst
> ./py3k_binary_protocols.rst
> ./venv_bootstrap.rst

I think it's rather nice, but it should be available as a stdlib
function rather than a "recipe" in the documentation.

Recipes are really overrated: they aren't tested, they aren't
maintained, they aren't part of a module's docstrings or
(pydoc-generated) contents, it's not obvious what kind of quality you
can expect from them (do they handle all cases correctly), it's not
obvious which Python versions they support. Raymond may like the idea,
but that doesn't make it a "good practice" Python should follow for its
batteries.

> On a somewhat related note, I'd also like to see us start
> concentrating higher level shell utilities in the shutil namespace so
> users don't have to check multiple locations for shell-related
> functionality quite so often (that is, I'd prefer shutil.globtree over
> glob.rglob).

Well, if glob() already lived in shutil, this decision would be a
no-brainer :) Having glob() in the glob module and globtree() in the
shutil module, though, looks a bit weird.
(I agree having a separate module for glob isn't ideal)
msg152922 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-09 01:58
We do have the option of aliasing glob.iglob as shutil.glob...
msg152925 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-09 04:15
>> Well, if glob() already lived in shutil, this decision would be a
no-brainer :) Having glob() in the glob module and globtree() in the
shutil module, though, looks a bit weird.
(I agree having a separate module for glob isn't ideal)


Would it be feasible to deprecate the 'glob' module, moving its functionality to shutil? In some future Python version, then, the module can be removed.

The same fate would go for fnmatch, I guess. There are too many modules lying around dealing with the same problems.

On a related note, the doc of glob explicitly mentions that it is implemented with os.listdir and fnmatch. Similarly, *if* the recursive glob gets accepted it should be implemented with walkdir (once that's in).
msg152943 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-09 12:07
This discussion (particularly my final globtree recipe) made me realise that the exact same approach would greatly improve the usability of the all_paths, file_paths and dir_paths iterators in walkdir [1]. Accordingly, walkdir 0.4 will let you write a recursive grep for ReST and Python source files as:

    file_paths(top, included_files="*.py *.rst".split())

Scanning multiple directories will be as simple as:

    file_paths(dir1, dir2, included_files="*.py *.rst".split())

[1] https://bitbucket.org/ncoghlan/walkdir/issue/15
msg152945 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-09 14:08
Thanks for the bug find Antoine, I worked surprisingly hard trying to make this right in more edge cases and while fixing it I noticed rglob/globtree has 3 options:

* Behave like a glob for every subdirectory. Meaning that every relative path gets a '*/' prepended to it. Eg rglob('c/d') started from the directory 'a' will yield 'a/b/c/d'.

* Behave like a glob for every subdirectory of the directory in the filter string. Meaning rglob('c/d') from dir 'a' won't yield 'a/b/c/d'. It would try to walk from 'a/c' and yield nothing if the directory 'c' doesn't exist in 'a'. Note that if the directory 'c' does exist then '/a/c/f/d' would be yielded. That seems kind of quirky to me.

* Behave like a filtered walk. Meaning that in order to yield files nested in subdirectories a wildcard must be introduced. Eg rglob('c/d') started from the directory 'a' won't yield 'a/b/c/d'. For that to occur you would need to use rglob('*c/d') or rglob('*/c/d'). What's more unfortunate is that rglob('d') doesn't yield 'a/b/c/d' which seems wrong. So I think for this we should special case paths that don't have path separators and prepend the "*/". Though some may argue it's wrong that rglob('d') yields 'a/b/c/d' even though rglob('c/d') won't yield it, I think that's the correct choice for this route.

Note that absolute paths with/without wildcards don't have this ambiguity. In both rel/abs wildcards should match directories and files alike.

Which option do you guys think would be best? I already have a fixed patch for option 1 and 3 but I'd rather hear your thoughts before I introduce either.



P.s. another slight issue I ran into is the fact that fnmatch doesn't ignore os.curdir:

    >>> fnmatch.fnmatch('./a', 'a')
    False
msg152946 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-02-09 14:49
I have to say that the non-obvious subtleties you point out in your rglob make me think I personally would probably opt to use Nick's module directly instead, so that I was sure what I was getting.
msg152949 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-09 15:18
> I have to say that the non-obvious subtleties you point out in your rglob make me think I personally would probably opt to use Nick's module directly instead, so that I was sure what I was getting.
>
> ----------

+1
msg152950 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-09 15:31
> I have to say that the non-obvious subtleties you point out in your rglob make me think I personally would probably opt to use Nick's module directly instead, so that I was sure what I was getting.

I didn't notice these corner cases initially because of their distance from the main use case. The main use-case is to glob the current directory or an absolute path tree using a wildcard, esp for finding all files of a given type. This leaves no ambiguity.

I'd like the edge cases of relative paths to behave in a documented and sensible way, but they pale in comparison to the usefulness of the proposal imho.

I believe we should decide what's the most useful and sensible behavior and have a shortcut for that.
msg152951 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-02-09 15:40
Hmm.  Just to make it clear where I'm coming from, though, I should also point out that I use rdiff-backup, which uses the **/yadayada syntax, and I hate it any time I have to try to figure out what such a glob is going to actually match.
msg152952 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-09 15:46
> * Behave like a glob for every subdirectory. Meaning that every
> relative path gets a '*/' prepended to it. Eg rglob('c/d') started
> from the directory 'a' will yield 'a/b/c/d'.

That's what I would expect. That way, rglob('__init__.py') would find
all files named __init__.py beneath the current directory.

> P.s. another slight issue I ran into is the fact that fnmatch doesn't
> ignore os.curdir:
> 
>     >>> fnmatch.fnmatch('./a', 'a')
>     False

Sounds ok. fnmatch is a low-level lexical thing.
msg152953 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-09 16:03
>> * Behave like a glob for every subdirectory. Meaning that every
>> relative path gets a '*/' prepended to it. Eg rglob('c/d') started
>> from the directory 'a' will yield 'a/b/c/d'.

> That's what I would expect. That way, rglob('__init__.py') would
> find all files named __init__.py beneath the current directory.

Perhaps we should make a single exemption for double dots eg rglob('../../__init__.py') starts the walk 2 folders out of the curdir and looks for '*/__init__.py'.
msg152955 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-09 16:16
> >> * Behave like a glob for every subdirectory. Meaning that every
> >> relative path gets a '*/' prepended to it. Eg rglob('c/d') started
> >> from the directory 'a' will yield 'a/b/c/d'.
> 
> > That's what I would expect. That way, rglob('__init__.py') would
> > find all files named __init__.py beneath the current directory.
> 
> Perhaps we should make a single exemption for double dots eg
> rglob('../../__init__.py') starts the walk 2 folders out of the curdir
> and looks for '*/__init__.py'.

This would be quirky. I don't think '..' should be treated specially.
(there's also the symlinks problem)
msg152957 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-09 16:27
> This would be quirky. I don't think '..' should be treated specially.
> (there's also the symlinks problem)

Again with 'a/b/c/d' and let's add a file 'a/b/png'.

If the curdir is 'c' and you use rglob('../pn*') you won won't find '../png' as you would be walking only in the curdir. I think that route would mean we should document double dots aren't supported.
msg152958 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-09 16:30
> Again with 'a/b/c/d' and let's add a file 'a/b/png'.
> 
> If the curdir is 'c' and you use rglob('../pn*') you won won't find
> '../png' as you would be walking only in the curdir.

That depends how you implement it. If you detect that ".." exists and
glob for "pn*" inside it, you will probably find "../png".
msg152959 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-09 16:35
> That depends how you implement it. If you detect that ".." exists and
glob for "pn*" inside it, you will probably find "../png".

Yes, that's what I meant by a "single exemption for double dots". The solution should start the walk from wherever the double dots lead it to.

I believe we agree.
msg152960 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-09 16:36
> Yes, that's what I meant by a "single exemption for double dots". The
> solution should start the walk from wherever the double dots lead it
> to.

I don't think that's a single exemption; you should use the same
algorithm for all globs (e.g. "a/*.py"), shouldn't you?
msg152962 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-09 16:46
> you should use the same algorithm for all globs (e.g. "a/*.py"), shouldn't you?

That specific string would start the walk from the current directory IIUC.
msg152965 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-09 16:53
> > you should use the same algorithm for all globs (e.g. "a/*.py"), shouldn't you?
> 
> That specific string would start the walk from the current directory IIUC.

Yes but would it match b/a/setup.py?
msg152966 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-02-09 17:39
How about having separate starting path and glob arguments, where the glob cannot contain any directory?

I'm -1 on this function as it stands.  My vote could change if the final semantics are intuitive and unambiguous.  (It's OK if getting the correct intuition requires understanding a brief explanation.)
msg152968 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-09 17:44
> My vote could change if the final semantics are intuitive and unambiguous.

How about "match the given glob from this directory and any
sub-directory"?
That sounds quite intuitive and unambiguous to me.
msg152971 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-02-09 18:03
So given

  /home/a
  /home/a/k.py
  /home/a/c/j.py
  /home/b/z.py
  /home/b/c/f.py

and a current directory of /home/a, we'd have:

   pattern   matches
   -------   -------
   *.py      k.py, c/j.py
   c/*.py    c/j.py
   c*        c               [?]
   ../*.py   ?
   ../c/*.py ?

Thinking about those .. cases makes my brain hurt :)  What does it mean to match '../*.py' when I'm recursing into the c subdirectory?  What does it mean in the current directory, for that matter?

As you can see, your short explanation has left me with one case where I have a question, and one set of cases where my intuition completely fails.

--David
msg152982 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-09 20:36
Given

  /home/a
  /home/a/k.py
  /home/a/c/j.py
  /home/b/z.py
  /home/b/c/f.py

and a current directory of /home/a, we'd have:

   pattern     matches
   -------     -------
   *.py        k.py, c/j.py
   c/*.py      c/j.py
   c*          c
   ../*.py     ../a/k.py, ../a/c/j.py, ../b/z.py, ../b/c/f.py
   ../c/*.py   ../b/c/f.py, ../a/c/j.py

For relative paths the double dots decide where to start walking and from then on you can imagine a glob on every subdir. In the last 2 examples the glob would have been '*.py' and 'c/*.py' respectively.
msg152986 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-02-09 21:01
Well, in that case I would expect that the argument 'c/*.py' would start walking in the c directory, but I definitely did not get that impression from Antoine's explanation of how the function works.

I again advocate separating the starting path specification from the glob specification, with the latter not allowing directories.  A two argument function.
msg152990 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-09 22:12
Given the complexities proposed for the dir matching, I'm shifting back to a solid -1 on this. Trying to match multi-part directories with globs is a nightmare and I currently don't allow it at all in walkdir. Instead, dir filtering and file filtering are expressed separately.

In 0.4, walkdir will allow the following to search the current directory for all Python source files in subdirectories named 'c':

    walk_files(included_files=["*.py"], included_dirs="c")

This discussion does mean I plan to add path filtering though, that works on the "dirpath" value directly (and *will* allow multi-part matches). So the "Python source files in subdirectories named 'd' of directories named 'c'" will be written something like:

    walk_files(included_files=["*.py"], included_paths="*/c/d")

I also plan to explore options to make the pattern matching function configurable (i.e. so you can pass in something like "match=re.search")
msg152994 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-09 22:37
> This discussion does mean I plan to add path filtering though, that
> works on the "dirpath" value directly (and *will* allow multi-part
> matches). So the "Python source files in subdirectories named 'd' of
> directories named 'c'" will be written something like:
> 
>     walk_files(included_files=["*.py"], included_paths="*/c/d")

That's not different from rglobbing "*/c/d/*.py", is it? :)
msg152998 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-02-09 22:48
I don't know, is it?  From what has been said so far I'd expect */c/d/*.py to look for *.py files in all c/d subdirectories of direct subdirectories of the current directory, and subdirectories of those c/d directories.  But I wouldn't expect the c/d matching to go any deeper than that one level.  Whereas with Nick's formulation, I would.  And I know at a glance, without reading any documentation, what Nick's formulation does, whereas I very much don't with rglob.
msg152999 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-09 22:50
> I don't know, is it?  From what has been said so far I'd expect
> */c/d/*.py to look for *.py files in all c/d subdirectories of direct
> subdirectories of the current directory, and subdirectories of those
> c/d directories.  But I wouldn't expect the c/d matching to go any
> deeper than that one level.

That would be a normal glob, not a recursive glob. We already have
glob.glob().
msg153001 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-09 23:01
As R. David Murray has suggested I think there may be a middle ground.

    def rglob(fn_filter, root='.'):

That would mean the default use case is still easy to remember as rglob('*.py') and also there aren't any explanations needed for how this function works.

Though I'm not sure which of these API's I like better.
msg153002 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-02-09 23:08
@antoine: no, my description involves recursion.  It assumes that the path portion of the glob specifies the directories to *start* from, but that the filename glob portion then applies recursively to any of those start directories.

The alternative interpretation of the pattern, that it is fully matched against the CWD, takes you back to having to explain what '..' means when matched against a given location in the recursion, starting with the CWD.  I clearly don't have a good intuition about that.
msg153003 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-02-09 23:09
Oh, yeah, and there's still the question of whether or not directories are matched by the terminal glob pattern, which I would naively expect they would be, in either interpretation, but I wouldn't be sure.
msg153004 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-09 23:10
> The alternative interpretation of the pattern, that it is fully
> matched against the CWD, takes you back to having to explain what '..'
> means when matched against a given location in the recursion, starting
> with the CWD.  I clearly don't have a good intuition about that.

My alternative interpretation is that the pattern is being matched
against whatever subdir of the current dir (and the current dir itself).
The matching is not different from glob()'s.
msg153005 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-02-09 23:47
Ah, OK, so what you are saying is that rglob returns the concatenation of the results of running ls with the argument glob in each subdirectory of a walk starting with the current directory, except that the returned names have paths anchored in the current directory?
msg153006 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-10 00:42
> Ah, OK, so what you are saying is that rglob returns the concatenation
> of the results of running ls with the argument glob in each
> subdirectory of a walk starting with the current directory, except
> that the returned names have paths anchored in the current directory?

Yup.
msg153013 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-10 04:03
FTR, packaging has an extended glob function (not my code) which supports sets (with { , }) and recursivity (**).
msg153085 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-02-11 01:00
I would like to see this one API remain simple and leave more complex tasks to os.walk et al.
msg153192 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-12 10:07
Raymond Hettinger, by simple do you mean a single argument rglob function? Or do you mean you prefer glob doesn't get a new function?
msg153208 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-02-12 17:14
The last forumulation of what rglob does "apply the glob to the current directory and all subdirectories recursively, returning the joined list with filenames anchored in the current directory" is simple and intuitive enough for me.  (I'm not sure it can be implemented using Nick's module.)

The discussions about starting directory and the meaning of '..' and implicitly prepending '*' totally confused me.  If the formulation in my first paragraph is the intended semantics and the function actually implements those semantics (however it does it under the hood), then I'm fine with it.
msg153281 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-02-13 17:20
I noticed this implementation on PyPI http://packages.python.org/rglob/ which sort of has rglob defined as

    def rglob(pattern, base='.'):

Which seems like the most comprehensible way of doing this, though not the most compact.

The code itself isn't in the best shape http://cpiekarski.com/2011/09/23/python-recursive-glob/ eg it uses list comprehensions instead of generators. But I think I like this API better as it really is easier to explain.
msg153300 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-13 20:48
> But I think I like this API better as it really is easier to explain.

I think you got it wrong. Looking at the implementation, it is no
different from the API proposed above, except that it also lets you
choose the start dir.
msg154600 - (view) Author: Michael.Elsdörfer (elsdoerfer) Date: 2012-02-29 07:56
This should absolutely be implemented as **:

- It is pretty much standard. Recursive globbing is not supported everywhere, but when it is, ** is used.

- A separate function will not allow me to let the *user* to decide when recursion should be used. I find this most important. When I need to find files internally, I always do so using os.walk etc. When I use glob, it is because I want to provide an interface to my users. 

- The change to support ** is actually pretty trivial. I have implemented this as a module here: https://github.com/miracle2k/python-glob2/

- It's backwards-compatible - or close enough anyway. ** is currently perfectly nonsensical, making it a meaningful syntax element is acceptable, I think.
msg154603 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-29 08:30
FWIW, I've also come around to the point of view that it's worthwhile to provide stdlib support for the "**" convention (specifically due to the UI aspect that Michael mentions).
msg157278 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-04-01 12:14
I found this comprehensive description of the '**' convention at http://www.codeproject.com/Articles/2809/Recursive-patterned-File-Globbing that can translate directly to unittests.

I'd like to fix the patch for these specs but should it be in a new rglob function or in the existing glob.glob()? I think it should be a new one to avoid any edge-case compatibility concerns even though on face value there shouldn't be any.
msg157280 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-01 12:20
> I found this comprehensive description of the '**' convention at
> http://www.codeproject.com/Articles/2809/Recursive-patterned-File-Globbing that can translate directly to unittests.
> 
> I'd like to fix the patch for these specs but should it be in a new
> rglob function or in the existing glob.glob()? I think it should be a
> new one to avoid any edge-case compatibility concerns even though on
> face value there shouldn't be any.

I think it should be the existing glob.glob(). We won't introduce a new
function any time we add a new syntactic feature in the glob
mini-language.
msg157281 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-04-01 12:25
I don't have a strong opinion on "rglob vs glob" so whichever way the majority here thinks is fine by me.
msg157287 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-01 13:42
For "**" globbing see http://ant.apache.org/manual/dirtasks.html#patterns .

If we extend pattern syntax of templates, why not implement Perl, Tcl or Bash extensions?
msg157290 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-04-01 14:04
On Sun, Apr 1, 2012 at 4:42 PM, Serhiy Storchaka <report@bugs.python.org> wrote:
> For "**" globbing see http://ant.apache.org/manual/dirtasks.html#patterns

They mention that "mypackage/test/ is interpreted as if it were mypackage/test/**" so that's not really an option. I'm pretty sure we should only recurse if "**" appears explicitly.

> If we extend pattern syntax of templates, why not implement Perl, Tcl or
> Bash extensions?
>

I'm not sure what you mean here but if it's that ##{} stuff then it should probably be discussed in a separate issue as it's not related to recursive globs.
msg159265 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-04-25 10:12
I added the doublestar functionality to iglob and updated the docs and tests.

Also, a few readability renames in that module were a long time coming.

I'd love to hear your feedback.
msg160057 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-05-06 06:21
So, anybody for or against this patch? I'd really like to see this feature make its way in...
msg160079 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-06 13:08
> So, anybody for or against this patch? I'd really like to see this
> feature make its way in...

I think the feature is useful, but someone needs to review the patch.
Sorry if it takes some time.
msg160743 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-15 16:42
I'm looking at the tests and I don't understand why '**/bcd/*' should match 'a/bcd/efg/ha'. Am I missing something?
msg176996 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-05 17:51
Here is a patch which implements recursive globbing which conforms to Bash globbing with "globstar" option.

For backward compatibility recursive globbing off by default and works only if new argument "recursive" is true (default is False). I am not sure this is a better variant. Possible the default should be True. '**' pattern is very unlikely in old code. However recursive globbing on arbitrary pattern and arbitrary tree is not safe, it can hang on recursive symlinks.

The patch contains changes from issue16618.
msg178810 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-02 12:32
Patch updated for current tip.
msg179947 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-14 16:02
I should add a symlink loop detecting to _rlistdir() as Antoine advised me on IRC.
msg179979 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-14 21:50
In fact glob() is already protected against an endless recursion (in the same way as Bash). The level of recursion is simply limited by the maximum length of the path. So I did not change the implementation, I have just added a test for symlink loop. I also corrected the other new tests so that they should not fail on the platform without symlinks.
msg203177 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-17 14:35
In updated patch fixed warning/errors when ran with -b or -bb options.
msg212878 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-03-07 14:06
Oops, Python 3.4 has ** support in pathlib, but we missed Serhiy's patch for the glob module itself. We should resolve that discrepancy for 3.5 :)
History
Date User Action Args
2014-04-04 13:19:20serhiy.storchakasetkeywords: + needs review
2014-03-07 14:06:44ncoghlansetmessages: + msg212878
versions: + Python 3.5, - Python 3.4
2013-11-17 14:35:49serhiy.storchakasetfiles: + glob_recursive_4.patch

messages: + msg203177
2013-01-14 21:51:00serhiy.storchakasetfiles: + glob_recursive_3.patch

messages: + msg179979
stage: needs patch -> patch review
2013-01-14 16:02:28serhiy.storchakasetmessages: + msg179947
stage: patch review -> needs patch
2013-01-02 12:33:27serhiy.storchakasetfiles: - glob_recursive.patch
2013-01-02 12:32:59serhiy.storchakasetstage: needs patch -> patch review
2013-01-02 12:32:39serhiy.storchakasetfiles: + glob_recursive_2.patch

messages: + msg178810
2012-12-29 22:24:20serhiy.storchakasetassignee: serhiy.storchaka
2012-12-05 17:51:49serhiy.storchakasetfiles: + glob_recursive.patch

nosy: + loewis
messages: + msg176996

dependencies: + Different glob() results for strings and bytes
2012-11-04 17:44:18serhiy.storchakasetstage: patch review -> needs patch
2012-06-29 00:05:04pitrousetversions: + Python 3.4, - Python 3.3
2012-05-15 16:42:24pitrousetmessages: + msg160743
2012-05-06 19:48:14cvrebertsetnosy: + cvrebert
2012-05-06 13:08:40pitrousetmessages: + msg160079
2012-05-06 08:21:52pitrousetstage: patch review
2012-05-06 06:21:28ubershmekelsetmessages: + msg160057
2012-04-25 10:12:53ubershmekelsetfiles: + glob.doublestars.patch

messages: + msg159265
2012-04-01 14:04:52ubershmekelsetmessages: + msg157290
2012-04-01 13:42:42serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg157287
2012-04-01 12:25:50ubershmekelsetmessages: + msg157281
2012-04-01 12:20:43pitrousetmessages: + msg157280
2012-04-01 12:14:45ubershmekelsetmessages: + msg157278
2012-03-23 12:45:03eli.benderskysetnosy: - eli.bendersky
2012-02-29 08:30:07ncoghlansetmessages: + msg154603
2012-02-29 07:56:34elsdoerfersetnosy: + elsdoerfer
messages: + msg154600
2012-02-13 20:48:46pitrousetmessages: + msg153300
2012-02-13 17:20:31ubershmekelsetmessages: + msg153281
2012-02-12 17:14:25r.david.murraysetmessages: + msg153208
2012-02-12 10:07:30ubershmekelsetmessages: + msg153192
2012-02-11 01:00:18rhettingersetnosy: + rhettinger
messages: + msg153085
2012-02-10 04:03:36eric.araujosetmessages: + msg153013
2012-02-10 00:42:46pitrousetmessages: + msg153006
2012-02-09 23:47:21r.david.murraysetmessages: + msg153005
2012-02-09 23:10:28pitrousetmessages: + msg153004
2012-02-09 23:09:44r.david.murraysetmessages: + msg153003
2012-02-09 23:08:35r.david.murraysetmessages: + msg153002
2012-02-09 23:01:06ubershmekelsetmessages: + msg153001
2012-02-09 22:50:51pitrousetmessages: + msg152999
2012-02-09 22:48:49r.david.murraysetmessages: + msg152998
2012-02-09 22:37:30pitrousetmessages: + msg152994
2012-02-09 22:12:41ncoghlansetmessages: + msg152990
2012-02-09 21:01:04r.david.murraysetmessages: + msg152986
2012-02-09 20:36:28ubershmekelsetmessages: + msg152982
2012-02-09 18:03:55r.david.murraysetmessages: + msg152971
2012-02-09 17:44:10pitrousetmessages: + msg152968
2012-02-09 17:39:14r.david.murraysetmessages: + msg152966
2012-02-09 16:53:51pitrousetmessages: + msg152965
2012-02-09 16:46:38ubershmekelsetmessages: + msg152962
2012-02-09 16:36:44pitrousetmessages: + msg152960
2012-02-09 16:35:01ubershmekelsetmessages: + msg152959
2012-02-09 16:30:34pitrousetmessages: + msg152958
2012-02-09 16:27:02ubershmekelsetmessages: + msg152957
2012-02-09 16:16:53pitrousetmessages: + msg152955
2012-02-09 16:03:45ubershmekelsetmessages: + msg152953
2012-02-09 15:46:55pitrousetmessages: + msg152952
2012-02-09 15:40:31r.david.murraysetmessages: + msg152951
2012-02-09 15:31:51ubershmekelsetmessages: + msg152950
2012-02-09 15:18:24eli.benderskysetmessages: + msg152949
2012-02-09 14:49:54r.david.murraysetnosy: + r.david.murray
messages: + msg152946
2012-02-09 14:08:18ubershmekelsetmessages: + msg152945
2012-02-09 12:07:39ncoghlansetmessages: + msg152943
2012-02-09 04:15:38eli.benderskysetmessages: + msg152925
2012-02-09 01:58:35ncoghlansetmessages: + msg152922
2012-02-09 00:10:51pitrousetmessages: + msg152918
2012-02-08 23:57:49ncoghlansetmessages: + msg152917
2012-02-08 18:59:48eli.benderskysetmessages: + msg152898
2012-02-08 17:18:47eric.araujosetmessages: + msg152894
title: Add a recursive function to the glob package -> Support recursive globs
2012-02-08 16:27:51pitrousetmessages: + msg152882
2012-02-08 16:15:07eli.benderskysetmessages: + msg152880
2012-02-08 16:09:59pitrousetmessages: + msg152879
2012-02-08 16:01:20eric.araujosetmessages: + msg152877
2012-02-08 15:59:23eli.benderskysetmessages: + msg152876
title: Support recursive globs -> Add a recursive function to the glob package
2012-02-08 15:54:34eric.araujosettitle: Add a recursive function to the glob package -> Support recursive globs
2012-02-08 15:52:05pitrousetmessages: + msg152873
2012-02-08 15:43:20eli.benderskysetmessages: + msg152871
2012-02-08 15:21:43eric.araujosetnosy: + eric.araujo

messages: + msg152868
versions: + Python 3.3, - Python 3.4
2012-02-08 14:14:01pitrousetmessages: + msg152858
2012-02-08 14:05:23eli.benderskysetnosy: + eli.bendersky
messages: + msg152857
2012-02-08 13:40:14ncoghlansetmessages: + msg152856
2012-02-08 12:57:24pitrousetmessages: + msg152853
2012-02-08 12:55:58ncoghlansetmessages: + msg152852
2012-02-08 12:49:16pitrousetnosy: + pitrou
messages: + msg152849
2012-02-08 12:47:34ubershmekelsetmessages: + msg152847
2012-02-08 12:10:37ncoghlansetnosy: + ncoghlan
messages: + msg152846
2012-02-08 11:53:46giampaolo.rodolasetnosy: + giampaolo.rodola
2012-02-08 11:22:26ubershmekelcreate