April 16, 2004
Converting Word documents to text

The thing I love about Python is that if it sounds simple, it usually is. Here's a script to save all Word documents in and below a given directory to text:

import fnmatch, os, pythoncom, sys, win32com.client

wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")

    for path, dirs, files in os.walk(sys.argv[1]):
        for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.doc')]:
            print "processing %s" % doc
            docastxt = doc.rstrip('doc') + 'txt'
            wordapp.ActiveDocument.SaveAs(docastxt, FileFormat=win32com.client.constants.wdFormatTextLineBreaks)

Requires Python and the Python for Windows extensions.

My stepmother has been horribly virus hit, so I'm off to rebuild her PC this weekend. Sigh. I'll burn a CD with all her data before the rebuild, of course, but I'm worried that some or all of her documents might be infected - so I'll back them all up as text files, too, so that in the last resort, she'll still have all her work.

Update: As Ryan points out, it also requires a copy of Word. Thanks, Ryan: I should have mentioned that.

Another update: I cleaned this up a little and submitted it as a Python Cookbook recipe: Converting Word documents to text.

BTW, the rebuild was a nightmare. Win2K refused to recognise the modem, which is, of course, the worst thing that could possibly have happened. I ended up buying a new one. The whole job, including running Windows Update etc, ended up taking over nine hours!

Posted to Python by Simon Brunning at April 16, 2004 02:32 PM

You forgot to mention that is also requires Microsoft Office.

I can do the same thing in VB Script. And Guess what! You don't need to install anything except MS Office.

Put the following text in a file with an extension of .vbs and you can run using it cscript.exe or wscript.exe (comes with windows). First arg is the doc file, second arg is the text file.

Set wdApp = CreateObject("Word.application")

Set docNew = wdApp.Documents.Open(WScript.Arguments(0))
docNew.SaveAs WScript.Arguments(1), 2 , False, "", True, "", False, False, False, False, False

Posted by: Ryan Ackley on April 16, 2004 07:18 PM

I usually just do:

strings filename | less

It doesn't require microsoft office.

Posted by: Chris on April 16, 2004 10:39 PM

I don't know VB Script at all. It looks like there's a line for line correspondence between the VB Script code you posted and the Python version - but you didn't replicate the whole script. Let's see the whole thing!

I have *no* idea what your command does. Does it really convert Word documents into usable text form? If so, then that's really impressive - but I suspect that it would just pull out too much.

Posted by: Simon Brunning on April 19, 2004 10:09 AM

Yeah... strings pulls out too much, but cutting the head and tail (and provided the document is mostly text) you get a fairly readable output. The method you post is a good conversion if you are running windows and have word on the system. I always end up getting word attachments at work where I don't have access to a windows PC, and that's where my method usually comes in handy. If the document looks important enough I'll sometimes open it up in OpenOffice, but it's surprising how often someone will attach a word document containing only text to an email rather than sending plan text to begin with.

Posted by: Chris on April 19, 2004 05:14 PM

God, yes, tell me about it! Bloody Word attachments; there ought to be a law.

Posted by: Simon Brunning on April 19, 2004 05:17 PM

Rather snotty tone from the VB Script user, considering it has no ready recursive equivalent to os.walk.
Anyway, one of these days I will try to get Python to 'walk' through all my Outlook folders, and output to text. Do you know if anyone has already done it?

Posted by: Manuel M. Garcia on April 19, 2004 07:34 PM

There's always Antiword for converting Word files to text: http://www.winfield.demon.nl/

Works great for those times you don't have Word, or Office, or Windows... or even python! ;)

Posted by: Pete Prodoehl on April 19, 2004 07:58 PM

I haven't heard of anything doing *exactly* what you are talking about, but it's probably worth looking at the SpamBayes Outlook Addin[1] for examples of working with Outlook from Python.

Thanks; another one for the toolbox...

[1] http://starship.python.net/crew/mhammond/spambayes/

Posted by: Simon Brunning on April 20, 2004 09:16 AM

That is the whole script. Thats all you need to do the same thing you did.

Not trying to be snotty, just pointing out that it is the same basic technology and it isn't that amazing. Congratulations, Python is yet another scripting language that can take advantage of Windows COM.

Posted by: Ryan Ackley on May 3, 2004 10:03 PM

*I'm* not trying to be snotty either, Ryan, but no it *doesn't* do the same thing. Your script converts a single document, mine does all documents in and below a given directory.

Posted by: Simon Brunning on May 4, 2004 09:19 AM

i should read from both text files and word documents from a vb application. plz help me with some code in vb


Posted by: prasad on May 11, 2004 12:21 PM

Sorry, Prased, I don't know (or want to know) VB. Why not use Python? ;-)

Posted by: Simon Brunning on May 11, 2004 04:46 PM

I will show you an easier way:

import win32com.client
app = win32com.client.Dispatch('Word.Application')
doc = app.Documents.Open('c:\\files\\mydocument.doc')
print doc.Content.Text

Please have a look to the Pywin32 Script Collection at http://www.win32com.de

Posted by: Mustafa Görmezer on July 27, 2005 09:46 AM

the Pywin32 Script Collection is now at http://win32com.goermezer.de

Posted by: Mustafa Görmezer on May 1, 2008 08:39 PM

Its astonishinghow much more attention I get from the opposite sex now that I own a Challenger!

Posted by: Epifania Korner on June 28, 2011 11:44 AM

Hi there, You've done an incredible job. I’ll certainly digg it and personally suggest to my friends. I am sure they will be benefited from this web site.

Posted by: Lavera Hudrick on June 30, 2011 05:25 AM

I didn’t quite get this when I first feature it. Merely when I went through it a forward time, it all became open. Thanks for the insight. Absolutely something to remember around.

Posted by: Margery Weckenborg on July 1, 2011 08:15 AM

We are a group of volunteers and opening a new scheme in our community. Your web site provided us with valuable info to work on. You've done an impressive job and our entire community will be grateful to you.

Posted by: jennique adams on July 2, 2011 01:31 PM
Post a comment

Email Address:



Remember info?