Showing posts with label openoffice. Show all posts
Showing posts with label openoffice. Show all posts

Wednesday, March 4, 2009

Bash and Python scripts to unzip and modify an OpenOffice .odt document

.odt files are actually containers (you can see one using unzip -l document.odt). Within the container, content is in the content.xml file. Script info source


Here's what I've figured out about opening, modifying, and saving the content of an .odt file:

  • To open the container for editing:
    # bash
    $unzip path/container.odt content.xml
    $unzip path/container.odt content.xml -d working/dir/path/
        # -d places content.xml in a different directory.
        # Creates a content.xml file where you want it.
    
    # python
    >>>import zipfile
    >>>odt_file = zipfile.ZipFile('path/to/file.odt','a')  
    >>>    # Options are 'r'ead only, 'w'rite only, and 'a'ppend to existing
    >>>raw_xml = odt_file.read('content.xml') 
    >>>    # Reads content.xml in as a string, doesn't place a file.
    

  • Modify the content.xml file by hand, or using a script, or using Python.
    >>> # Tip for using python: ElementTree is good at XML, and it can parse a file, but it cannot parse a string!
    >>> # So here's how to make a string look like a file using python's StringIO module.
    
    >>>import StringIO, xml.etree.ElementTree as ET
    >>>fakefile = StringIO.StringIO(raw_xml)   # Pretend raw_xml string is a file called fakefile
    >>>tree = ET.parse(fakefile).getroot()     # Parse the fakefile
    >>>fakefile.close()                        # close() is desired by StringIO to free the buffer
    >>> # Make changes to your tree here.
    

  • To restore the container with modified content:
    # bash
    zip -j path/container.odt working/dir/path/content.xml
        # The -j flag adds the file 'content.xml' instead of the useless 'path/content.xml'. You need this!
    rm working/dir/path/content.xml    # Clean up
    
    # python
    >>>new_xml = ET.tostring(tree) # If you're exporting from an ElementTree
    >>>odt_file.writestr('content.xml', new_xml)
    >>>odt_file.close()
    
  • Putting it all together in bash:
    cd path/to/working/directory
    cp path/to/template_filename.odt working_file.odt
    unzip working_file.odt content.xml
    
    # Change the xml...somehow
    
    zip -j working_file.odt content.xml
    rm content.xml
    
  • Putting it all together in python:
    def edit_the_odt_content(template_filename):
        """Exposes the content of an .odt file so you can modify it."""
        import os, shutil, StringIO, zipfile, xml.etree.ElementTree as ET
    
        shutil.copyfile(template_filename, 'working_file.odt') # Copy the template into a working file
        odt_file = zipfile.ZipFile('working_file.odt','a')
        xml_string = odt_file.read('content.xml')              # Read the zipped content.xml within the .odt as a string
        raw_xml = StringIO.StringIO(xml_string)                # Pretend the read string is a file so ElementTree will parse it
        tree = ET.parse(raw_xml).getroot()                     # Convert raw string to ElementTree
        raw_xml.close()
    
        office_namespace = '{urn:oasis:names:tc:opendocument:xmlns:office:1.0}'
        body = tree.find(office_namespace + 'body')            # Search the tree to find the elements you want to change
        text = body.find(office_namespace + 'text')
        new_text = your_function_to_modify_the_xml(text)       # You can now change the XML any way you wish
        body.remove(text)                                      # Replace the old XML with the new
        body.append(new_text)
    
        new_xml = ET.tostring(tree)                            # Convert the modified ElementTree back into an XML string
        odt_file.writestr('content.xml', new_xml)              # Write the string into the zipped content.xml
        odt_file.close()                                       # Close the zip archive (important!)
        return
    
  • Bug: Don't use the zip -m flag! It looks handy, claiming to delete the content.xml file from your file system after adding it to the archive...but instead it will unpredictably delete without adding to the archive.

  • You can avoid the whole "containers" muddle by saving an OpenOffice document as a flat file (.fodt). There's no zipping or unzipping, just open the file in an editor - it's xml already. Open the modified .fodt with OpenOffice, and your document is right there. Er, be sure your version of OO supports .fodt before using it. My Mac doesn't, for example.

  • OpenOffice also has it's own script classes for Python and C, called UNO. However, I haven't taken time to dig around through it.

Friday, February 20, 2009

Using Python to reformat the xml within .odt files

A quick python 2,x script to copy, unzip, and uniformly reformat the XML of an .odt file. It adds indentations and line breaks. Useful to debug my invoicing script, which muddles with the xml files, by making the files diff-able and easier to read and easier to search.

#!/usr/bin/env python
import os
import xml.etree.ElementTree as ET
odt_path_and_file = 'path/to/file.odt'

# This function was copied from http://effbot.org/zone/element-lib.htm
def indent(elem, level=0):
    i = "\n" + level*"  "
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

odt_filename = odt_path_and_file.split('/')[-1]
folder_name = ('Desktop/' + odt_path_and_file.split('/')[-1].rstrip('.odt'))
os.popen('rm -r ' + folder_name) #Delete any old working files
os.popen('mkdir ' + folder_name)
os.popen('cp ' + odt_path_and_file + ' ' + folder_name)
os.popen('unzip ' + folder_name + '/' + odt_filename + ' -d ' + folder_name)
reply = os.popen('ls ' + folder_name)
file_list = [filename.rstrip('\n') for filename in reply.readlines() if filename.count('.xml') > 0]
for file in file_list:
    print ('Parsing ' + folder_name + '/' + file)
    tree = ET.parse(folder_name + '/' + file)
    indent(tree.getroot())
    tree.write(folder_name + '/' + file)
    print ('Completed ' + file)

Using Python to compare .odt files

A quick python script to copy, unzip, and reformat an .odt file. It adds indentations and line breaks. Useful to debug my invoicing script, which muddles with the xml files, by making the files diff-able and easier to read and easier to search.

#!/usr/bin/env python
import os
import xml.etree.ElementTree as ET
odt_path_and_file = 'path/to/file.odt'

# This function was copied from http://effbot.org/zone/element-lib.htm
def indent(elem, level=0):
    i = "\n" + level*"  "
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

odt_filename = odt_path_and_file.split('/')[-1]
folder_name = ('Desktop/' + odt_path_and_file.split('/')[-1].rstrip('.odt'))
os.popen('rm -r ' + folder_name) #Delete any old working files
os.popen('mkdir ' + folder_name)
os.popen('cp ' + odt_path_and_file + ' ' + folder_name)
os.popen('unzip ' + folder_name + '/' + odt_filename + ' -d ' + folder_name)
reply = os.popen('ls ' + folder_name)
file_list = [filename.rstrip('\n') for filename in reply.readlines() if filename.count('.xml') > 0]
for file in file_list:
    print ('Parsing ' + folder_name + '/' + file)
    tree = ET.parse(folder_name + '/' + file)
    indent(tree.getroot())
    tree.write(folder_name + '/' + file)
    print ('Completed ' + file)

Friday, January 2, 2009

Web Scraper for string prices

I successfully tested a web scraper in python. It scrapes about 20 web pages for the prices of violin strings, then puts the prices in an OpenOffice document for handy printing. It is structured so I can add other output formats, and I could add an XML file to track prices over time or just print changes.

I'm installing it on the store iMac, and setting it as a daily recurring job. The finished file just pops onto the desktop, marked with the date.

A future version may compare prices from multiple sites.

The script and template live in the standard location for user scripts, /Users/username/Library/Scripts/scriptname/

Saturday, April 5, 2008

Converting bitmap images to vector graphics

Our sign contractor for the store created some beautiful graphics - but the disk they gave us was all .jpg and bitmap .pdf and even a bitmap .ai file. Limited usefulness for reuse unless we can convert them to vector graphics - bitmaps are big and pixellated, vectors are small and infinitely scalable smoothly.

Used Inkscape. Turned out to be incredibly easy. Import the bitmap, then Path -> Trace Bitmap. Next, File -> Document Properties to crop the page area (so it doesn't save as one logo in the corner of an empty sheet of paper). Save the converted file.

Next problem: Inkscape saves as .svg, but OpenOffice can't open it. Easy to fix - instead of .svg, have Inkscape save as .odg, an OpenOffice drawing file. I love when things work out.

Monochrome and color laser prints of the converted graphics are great - no pixels, smooth and clean edges. Interesting: The bitmaps look better on screen, but the vector graphics are superior on paper.