Lumigent Log Explorer is a potential life saver, as I've said before, but there are some oddities with it. All the data is there, but we've had trouble getting exactly what we want from it in exactly the way that we want it.
No matter - it has a facility to export all your raw transactions to XML. Sorted! (Well, we had to fix a couple of very minor issues before the XML was well formed - & characters were not escaped to &, and we needed to add add encoding declaration. Still, not far off.)
I've not processed much XML, and not done any at all for a while. I was never really comfortable with the Python XML libraries that I'd played with, so I thought I'd give the effbot's ElementTree module a try.
The API is lovely. After no more than five or ten minutes, I felt like I knew what I was doing.
An example. A radically truncated version of the XML output from Log Explorer might look like this - oh-bugger-its-all-gone-a-bit-pete-tong-lets-hope-i-can-recover-the-data-from-this.xml. (The real thing was over seventy MB in size.) Code to loop through all the records, pull out the relevant details (including all the row's data from a sub-element) is as simple as this:
import cElementTree as ElementTree
# Parse XML...
tree = ElementTree.parse("oh-bugger-its-all-gone-a-bit-pete-tong-lets-hope-i-can-recover-the-data-from-this.xml")
root = tree.getroot()
for record in root:
# Pull out tags
timestamp = record.findtext('DATETIME')
opcode = record.findtext('OPCODETXT')
table = record.findtext('TABLENAME')
rowdata = dict((column.tag, column.text or '') for column in (record.find('ROWDATA') or []))
# Complex stuff here...
print timestamp, opcode, table, rowdata
Nice, eh?
Clearly my code did something a bit more complex that just printing out the data, but you get the idea. In fact, I'm rather pleased with the script on the whole. It does an awful lot with not much code - Python's dictionaries, lists and string interpolation do most of the work.
Performance? Now, I'm rather wary of venturing into benchmarking territory, so I'll just say that cElementTree goes like stink, and leave it at that.
Frankly, I'm almost always totally uninterested by benchmarks in any case. Software only has two speeds - fast enough, and not fast enough. cElementTree is comfortably in the fast enough range. Beyond that, I honestly couldn't care less.
Posted to Python by Simon Brunning at February 04, 2005 12:14 PMI'm a dab hand at parsing XML using RPG if you ever feel the need.
Posted by: Steve on February 4, 2005 01:38 PMBet that's pretty. How many lines of code would it take to do the above? Care to post it? ;-)
Posted by: Simon Brunning on February 4, 2005 01:50 PMfootnote: for large files, "iterparse" is your friend.
for event, record in iterparse(...):
if record.tag == "record":
... process record ...
record.clear()
more here: http://effbot.org/zone/element-iterparse.htm
Posted by: Fredrik on February 4, 2005 06:54 PMMmmm..... Python code can be so elegant. The iterator comprehension works really well - I hadn't seen it "in the wild" before.
Posted by: Alan Green on February 4, 2005 10:40 PMBy some reason the following line gives an error "Invalid sintax". Are you sure that the dict(...) sintax you are using is correct?
"rowdata = dict((column.tag, column.text or '') for column in (record.find('ROWDATA') or []))"
Otherwise the example ilustrates once again the power of Python.
Posted by: Adrian on February 8, 2005 11:45 AMWorks for me. Are you running Python 2.4? If not, you can't do generator expressions. You should be avble to use a list comp instead. Try this (untested):
rowdata = dict([(column.tag, column.text or '') for column in (record.find('ROWDATA') or [])])
Posted by: Simon Brunning on February 8, 2005 11:59 AMfootnote 2: the "(record.find('ROWDATA') or [])" part can be written as "record.findall('ROWDATA/*')"
Posted by: Fredrik on February 8, 2005 12:38 PMThanks, Fredrik. And thanks for ElementTree!
Would using iterparse merely be quicker, or would it cut down on memory overhead?
The findall thing is nice.
Posted by: Simon Brunning on February 8, 2005 12:59 PM"Would using iterparse merely be quicker, or would it cut down on memory overhead?"
Plain parse is faster than iterparse, but iterparse+test can be faster than parse+navigate, at least for simple structures. But the main advantage is memory use: by explicitly removing stuff when you don't need it any more (record.clear()), you can parse very large files using very little memory.
(but note that record.clear() still leaves empty record elements in the parent element, so if we're talking millions of records, you may want to use clear() on the parent instead of the records).
Posted by: Fredrik on February 8, 2005 07:02 PMhttp://boards.sonypictures.com/boards/member.php?u=55540
Posted by: http://boards.sonypictures.com/boards/member.php?u=55540 on October 22, 2008 10:09 AMhttp://boards.sonypictures.com/boards/member.php?u=55540
Posted by: http://boards.sonypictures.com/boards/member.php?u=55540 on October 26, 2008 06:32 PMhttp://finalfantasyrpg.pytalhost.de/upload/member.php?u=10319
Posted by: http://finalfantasyrpg.pytalhost.de/upload/member.php?u=10319 on November 5, 2008 09:38 AMhttp://www.pbpp.state.pa.us/ova/OVA_Redirector.asp?U=http://xboxoffer.com
Posted by: http://www.pbpp.state.pa.us/ova/OVA_Redirector.asp?U=http://xboxoffer.com on November 12, 2008 12:54 PMhttp://www.tpwd.state.tx.us/leave/index.phtml?u=http://xboxoffer.com
Posted by: http://www.tpwd.state.tx.us/leave/index.phtml?u=http://xboxoffer.com on November 15, 2008 10:39 PM