New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a generic directory walker method to avoid symlink attacks #57943
Comments
This is an offspring of bpo-4489 (which is a security bug). The method is AFAIU intended to be private. As shown in the discussion of the mentioned bpo-4489, there is a whole family of attacks that exploit the time window between gathering path names and executing a function on them. A general description of this problem can be found in: https://www.securecoding.cert.org/confluence/display/seccode/POS35-C.+Avoid+race+conditions+while+checking+for+the+existence+of+a+symbolic+link While the consequences in rmtree() are probably most dramatic, other recursive functions could benefit too (chmodtree() and chowntree() were mentioned) so Charles-François suggested to write a "generic walker method that would take as argument the methods to call on a directory and on a file (or link)". Some (probably) necessary helper functions has been already implemented in bpo-4761 (*at()) and bpo-10755 (fdlistdir()). Has there already been done any work? Ross mentioned he wanted to take a stab? |
I'm working on a library of general directory walking tools that will hopefully make their way back into the stdlib at some point (http://walkdir.readthedocs.org). They're designed to filter and transform the output of os.walk (and similar iterators) in various ways. It may provide a good environment for prototyping a general purpose "tree_map" for applying an operation to a filesystem tree without being vulnerable to symlink attacks. |
Unfortunately, I'm rather busy at the moment but when I get some free time and if no one else wants to work on it then I'll take a look. |
Since walkdir is currently entirely based on returning filesystem paths as strings (just like os.walk()) and hence shares the pervasive symlink attack vulnerability, I'm particularly interested in the question of whether or not the various *at APIs can be used to avoid symlink attacks if we just have a os.walkfd() API that emits a (dirfd, subdirs, files) triple instead of the os.walk style (dirpath, subdirs, files). The reason I'd find that interesting is that many of walkdir's filtering steps (notably those for including and excluding directories) don't care about the value of dirpath, so they could still be used with such an API. Thoughts? |
Another, possibly better, alternative would be to produce a tuple-subclass that adds a separate "dirfd" attribute to the (dirpath, subdirs, files) triple. I'll stop talking about the walkdir implications here. Instead, I've created a corresponding issue on walkdir's own tracker: |
Be aware that you have to manage dirfd's lifetime, which can make things |
Here's a possible walkfd() implementation. Example: topfd = os.open(sys.argv[1], os.O_RDONLY)
for rootfd, dirs, files in os.walkfd(topfd):
print(rootfd, dirs, files)
$ ./python ~/testwalkfd.py /etc/apt/
3 ['sources.list.d', 'preferences.d', 'trusted.gpg.d', 'apt.conf.d']
['trustdb.gpg', 'trusted.gpg~', 'sources.list', 'trusted.gpg']
4 [] []
4 [] []
4 [] []
4 [] ['70debconf', '01autoremove', '00trustcdrom']
[44194 refs]
""" AFAICT, a safe rmtree could be implemented simply with
Basically, this means that doing: is valid whereas print([(fstat(rootfd), dirs, files) for (rootfd, dirs, files) in
I'm not sure I understand this. Why "you can't open that directory |
Hmm, sorry, I must have misremembered. I thought openat didn't follow As for the patch, I think there's a problem with the API: + This behaves exactly like walk(), except that it accepts a file descriptor It doesn't tell you to which dirname corresponds dirfd, so you don't
Also, walkfd would be easier to use if callable with a str or bytes path |
Thanks for that Charles-François - do you mind if I adapt that for walkdir? The changes I would make are basically those that Antoine pointed out:
*I'm still interested in opinions on this aspect. I see 5 main possibilities:
I'm currently leaning towards the simple 4-tuple approach - it's simple, explicit and the walkdir pipeline operations can easily accept either underlying iterable by using indexing operations rather than tuple unpacking (a change I already planned to make so that the pipeline passed along the objects from the underlying iterable, anyway) (I'll also need to create ctypes-based variants of all the relevant *at functions, since the stdlib wrappers won't be available in existing versions of Python) |
I would also take that approach. It seems simplest to me. |
OK, I was afraid I had missed something. > As for the patch, I think there's a problem with the API Yes, it was really a proof-of-concept, the directory names are missing.
Agreed.
Well, that's not easy:
for dirfd, dirs, files in os.walkfd(topfd, topdown=False):
for file in files:
os.unlinkat(dirfd, file)
for dir in dirs:
os.unlinkat(dirfd, dir, os.AT_REMOVEDIR)
Of course not, go ahead. I'll update walkfd() accordingly, and write
Sounds good to me. |
OK, os.walkfd is sounding good:
As far as walkdir integration goes, I currently plan to add it as a "Directory Walking" subsection in shutil before the first alpha. However, it needs a few more updates in PyPI first (e.g. preserving the tuples produced by the underlying iterators, making sure it behaves itself when handed binary paths). I'll post to python-dev about it before I actually commit anything. |
Here's a patch with tests and documentation.
|
Here's an updated version. Note that I'm not pushing towards changing the current behavior for root, dirs, files in os.walk(top):
for file in files:
f = open(file)
for n, l in enumerate(f, 1):
pass
print(file, n) If, suddently, a symlink to a directory appeared in files, this will |
Hmm, given the various *at() APIs that sort alphabetically next to their path based counterparts, perhaps we can make the naming consistency change go the other way? (i.e. listdirfd() and walkfd()). Even if POSIX puts the fd at the front, do we really have to follow them in a silly naming scheme? And as per the python-dev discussion, +1 for being consistent with os.walk() when it comes to symlinks to directories. Aside from the naming question, is there anything else holding this up? |
Taking a closer look at the current naming scheme in the os module, fdopen() appears to be the only current function that uses the 'fd' prefix. All the other operations that accept a file descriptor just use 'f' as the prefix (fchmod, fchown, faccess, etc). Given that, flistdir() and fwalk() seem like the most consistent choices of name for APIs that aren't directly matching an underlying POSIX function name. |
There's something I don't understand in the patch: why does _are_same_file examine st_mode? |
Well, that seems OK for me.
It doesn't have to, that's actually useless. The only thing that bothers me is that it needs O(height of directory |
One other thing: is it deliberate to silence errors in the following snippet? + try: |
It's to mimic os.walk()'s behavior: http://hg.python.org/cpython/file/bf31815548c9/Lib/os.py#l268 |
Here are two new versions, both addressing Antoine's and Nick's comments:
|
I think the O(depth) version is fine. The O(1) version is quite more On modern systems you have at least 1024 fds, so the restriction |
I agree.
Actually, I think you're much more likely to run above the max |
I'm also a fan of using the simpler approach unless/until we have solid evidence that the file descriptor limit could be a problem in practice. A comment in the code mentioning the concern, along with the fact that there's an alternate algorithm attached to this tracker issue that avoids it would probably be appropriate though. |
New changeset 773a97b3927d by Charles-François Natali in branch 'default': |
Committed, thanks for the comments. Note to myself (and others that might be interested in the O(1)) version): |
It would be nice if the documentation of fwalk() explained why you would want to use it over walk(). |
How does the attached patch look? |
2012/5/12 Charles-François Natali <report@bugs.python.org>:
Okay, but explain what a "symlink attack" is. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: