April 16, 2004
Converting Word documents to text
The thing I love about Python is that if it sounds simple, it usually is. Here's a script to save all Word documents in and below a given directory to text:
import fnmatch, os, pythoncom, sys, win32com.client
wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")
for path, dirs, files in os.walk(sys.argv):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.doc')]:
print "processing %s" % doc
docastxt = doc.rstrip('doc') + 'txt'
Requires Python and the Python for Windows extensions.
My stepmother has been horribly virus hit, so I'm off to rebuild her PC this weekend. Sigh. I'll burn a CD with all her data before the rebuild, of course, but I'm worried that some or all of her documents might be infected - so I'll back them all up as text files, too, so that in the last resort, she'll still have all her work.
Update: As Ryan points out, it also requires a copy of Word. Thanks, Ryan: I should have mentioned that.
Another update: I cleaned this up a little and submitted it as a Python Cookbook recipe: Converting Word documents to text.
BTW, the rebuild was a nightmare. Win2K refused to recognise the modem, which is, of course, the worst thing that could possibly have happened. I ended up buying a new one. The whole job, including running Windows Update etc, ended up taking over nine hours!
Posted to Python by Simon Brunning at April 16, 2004 02:32 PM
You forgot to mention that is also requires Microsoft Office.
I can do the same thing in VB Script. And Guess what! You don't need to install anything except MS Office.
Put the following text in a file with an extension of .vbs and you can run using it cscript.exe or wscript.exe (comes with windows). First arg is the doc file, second arg is the text file.
Set wdApp = CreateObject("Word.application")
Set docNew = wdApp.Documents.Open(WScript.Arguments(0))
docNew.SaveAs WScript.Arguments(1), 2 , False, "", True, "", False, False, False, False, False
I usually just do:
strings filename | less
It doesn't require microsoft office.
I don't know VB Script at all. It looks like there's a line for line correspondence between the VB Script code you posted and the Python version - but you didn't replicate the whole script. Let's see the whole thing!
I have *no* idea what your command does. Does it really convert Word documents into usable text form? If so, then that's really impressive - but I suspect that it would just pull out too much.
Yeah... strings pulls out too much, but cutting the head and tail (and provided the document is mostly text) you get a fairly readable output. The method you post is a good conversion if you are running windows and have word on the system. I always end up getting word attachments at work where I don't have access to a windows PC, and that's where my method usually comes in handy. If the document looks important enough I'll sometimes open it up in OpenOffice, but it's surprising how often someone will attach a word document containing only text to an email rather than sending plan text to begin with.
God, yes, tell me about it! Bloody Word attachments; there ought to be a law.
Rather snotty tone from the VB Script user, considering it has no ready recursive equivalent to os.walk.
Anyway, one of these days I will try to get Python to 'walk' through all my Outlook folders, and output to text. Do you know if anyone has already done it?
There's always Antiword for converting Word files to text: http://www.winfield.demon.nl/
Works great for those times you don't have Word, or Office, or Windows... or even python! ;)
I haven't heard of anything doing *exactly* what you are talking about, but it's probably worth looking at the SpamBayes Outlook Addin for examples of working with Outlook from Python.
Thanks; another one for the toolbox...
That is the whole script. Thats all you need to do the same thing you did.
Not trying to be snotty, just pointing out that it is the same basic technology and it isn't that amazing. Congratulations, Python is yet another scripting language that can take advantage of Windows COM.
*I'm* not trying to be snotty either, Ryan, but no it *doesn't* do the same thing. Your script converts a single document, mine does all documents in and below a given directory.
i should read from both text files and word documents from a vb application. plz help me with some code in vb
Sorry, Prased, I don't know (or want to know) VB. Why not use Python? ;-)
I will show you an easier way:
app = win32com.client.Dispatch('Word.Application')
doc = app.Documents.Open('c:\\files\\mydocument.doc')
Please have a look to the Pywin32 Script Collection at http://www.win32com.de
Its astonishinghow much more attention I get from the opposite sex now that I own a Challenger!
Hi there, You've done an incredible job. Iâ€™ll certainly digg it and personally suggest to my friends. I am sure they will be benefited from this web site.
I didnâ€™t quite get this when I first feature it. Merely when I went through it a forward time, it all became open. Thanks for the insight. Absolutely something to remember around.
We are a group of volunteers and opening a new scheme in our community. Your web site provided us with valuable info to work on. You've done an impressive job and our entire community will be grateful to you.