.odt files are actually containers (you can see one using unzip -l document.odt
). Within the container, content is in the content.xml
file. Script info source
Here's what I've figured out about opening, modifying, and saving the content of an .odt file:
- To open the container for editing:
# bash $unzip path/container.odt content.xml $unzip path/container.odt content.xml -d working/dir/path/ # -d places content.xml in a different directory. # Creates a content.xml file where you want it. # python >>>import zipfile >>>odt_file = zipfile.ZipFile('path/to/file.odt','a') >>> # Options are 'r'ead only, 'w'rite only, and 'a'ppend to existing >>>raw_xml = odt_file.read('content.xml') >>> # Reads content.xml in as a string, doesn't place a file.
- Modify the
content.xml
file by hand, or using a script, or using Python.
>>> # Tip for using python: ElementTree is good at XML, and it can parse a file, but it cannot parse a string! >>> # So here's how to make a string look like a file using python's StringIO module. >>>import StringIO, xml.etree.ElementTree as ET >>>fakefile = StringIO.StringIO(raw_xml) # Pretend raw_xml string is a file called fakefile >>>tree = ET.parse(fakefile).getroot() # Parse the fakefile >>>fakefile.close() # close() is desired by StringIO to free the buffer >>> # Make changes to your tree here.
- To restore the container with modified content:
# bash zip -j path/container.odt working/dir/path/content.xml # The -j flag adds the file 'content.xml' instead of the useless 'path/content.xml'. You need this! rm working/dir/path/content.xml # Clean up # python >>>new_xml = ET.tostring(tree) # If you're exporting from an ElementTree >>>odt_file.writestr('content.xml', new_xml) >>>odt_file.close()
- Putting it all together in bash:
cd path/to/working/directory cp path/to/template_filename.odt working_file.odt unzip working_file.odt content.xml # Change the xml...somehow zip -j working_file.odt content.xml rm content.xml
- Putting it all together in python:
def edit_the_odt_content(template_filename): """Exposes the content of an .odt file so you can modify it.""" import os, shutil, StringIO, zipfile, xml.etree.ElementTree as ET shutil.copyfile(template_filename, 'working_file.odt') # Copy the template into a working file odt_file = zipfile.ZipFile('working_file.odt','a') xml_string = odt_file.read('content.xml') # Read the zipped content.xml within the .odt as a string raw_xml = StringIO.StringIO(xml_string) # Pretend the read string is a file so ElementTree will parse it tree = ET.parse(raw_xml).getroot() # Convert raw string to ElementTree raw_xml.close() office_namespace = '{urn:oasis:names:tc:opendocument:xmlns:office:1.0}' body = tree.find(office_namespace + 'body') # Search the tree to find the elements you want to change text = body.find(office_namespace + 'text') new_text = your_function_to_modify_the_xml(text) # You can now change the XML any way you wish body.remove(text) # Replace the old XML with the new body.append(new_text) new_xml = ET.tostring(tree) # Convert the modified ElementTree back into an XML string odt_file.writestr('content.xml', new_xml) # Write the string into the zipped content.xml odt_file.close() # Close the zip archive (important!) return
- Bug: Don't use the
zip -m
flag! It looks handy, claiming to delete the content.xml file from your file system after adding it to the archive...but instead it will unpredictably delete without adding to the archive. - You can avoid the whole "containers" muddle by saving an OpenOffice document as a flat file (.fodt). There's no zipping or unzipping, just open the file in an editor - it's xml already. Open the modified .fodt with OpenOffice, and your document is right there. Er, be sure your version of OO supports .fodt before using it. My Mac doesn't, for example.
- OpenOffice also has it's own script classes for Python and C, called UNO. However, I haven't taken time to dig around through it.
No comments:
Post a Comment