October 05, 2006
The joy of os.walk()

One thing that came up yesterday was that people still aren't feeling comfortable with os.walk(). Which is a shame - I love it. I mentioned that I'd used it only that day, in a nice little script that locates malformed XML in a directory tree, and Simon suggested that I post it. So, here it is:

for path, dirs, files in os.walk(os.getcwd()):
    for xml in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.xml')]:
        try:
            ElementTree.parse(xml)
        except (SyntaxError, ExpatError):
            print xml, "\tBADLY FORMED!"

Syntax highlighted version, with imports: find_dodgy_xml.py.

Notice that it takes more code to import ElementTree than it does to do the actual work! It'll be nice what we can rely on version 2.5 being available, but that's a while away.

Hmm, actually, this is such a common pattern that it's probably worth a helper function:

def locate(pattern, root=os.getcwd()):
    for path, dirs, files in os.walk(root):
        for filename in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, pattern)]:
            yield filename

Syntax highlighted version: locate.py.

This simplifies the main loop to:

for xml in locate("*.xml"):
    try:
        ElementTree.parse(xml)
    except (SyntaxError, ExpatError):
        print xml, "\tBADLY FORMED!"

Worth a cookbook recipe, or is it too simple?

Posted to Python by Simon Brunning at October 05, 2006 04:20 PM
Comments

Definitely worth a cookbook entry.

Posted by: Paul Moore on October 5, 2006 04:49 PM

Not too simple, IMO! But personally, I think Jason Orendorff's path module is the nicest way to do this sort of thing. Using path, your locate() call would look like:

from path import path
for xml in path('.').walkfiles('*.xml'): ...

Plus much more. It should be in the standard library, if you ask me. :-)

http://www.jorendorff.com/articles/python/path/

Graham

Posted by: Graham Fawcett on October 5, 2006 04:55 PM

Perhaps, but path *won't* be in the standard library, at least not in its current form. Guido has spoken. The path module does several different things - it deals with paths as strings, as well as actual files and directories. It should be broken up. Also, using "/" as a join operator isn't very popular.

Posted by: Simon on October 5, 2006 04:59 PM

I think you'd be better off using os.path.curdir instead of os.getcwd(), especially for the default argument value.

Also, I think that if you do

root = os.path.abspath(root)

once before the loop, you won't have to all abspath on every file you find.

And thirdly, I'm pretty sure you want to yield os.path.join(path, filename) rather than just raw filename.

Posted by: Marius Gedminas on October 5, 2006 08:27 PM

> os.path.curdir instead of os.getcwd()

You mean os.curdir? Good point. (This will mean that if the current directory changes between when the function is defined and when it is run, the newer current directory will be used.)

> root = os.path.abspath(root)

Good idea.

> os.path.join(path, filename) rather than just raw filename

This was being done in the list comp. It's probably a better idea to move it out to the yield for readability. Also, I can use a generator expression rather than a list comp.

An updated version looks like:

def locate(pattern, root=os.curdir):
    for path, dirs, files in os.walk(os.path.abspath(root)):
        for filename in (filename for filename in files if fnmatch.fnmatch(filename, pattern)):
            yield os.path.join(path, filename)

Thanks, Marius.

Posted by: Simon on October 6, 2006 01:38 PM

def locate(pattern, root=os.curdir):
for path, dirs, files in os.walk(os.path.abspath(root)):
for filename in fnmatch.filter(files, pattern):
yield os.path.join(path, filename)

is yet more succinct.

Posted by: Tom Lynn on October 12, 2006 11:12 AM

Oooh, nice, thanks Tom.

Posted by: Simon on October 12, 2006 11:32 AM
Post a comment
Name:


Email Address:


URL:



Comments:


Remember info?