The mq HOWTO: Batch editing of metadata

Some editing tasks can be carried out with typical metadata editors, some are easier with plain text editors, and some really require so much additional thinking that it makes sense to write a program of some kind to make the changes.
Examples of the latter kind are things where you need to make a decision of some sort but you know that the decision will be determined by things that are already in the metadata.

Some specific examples:

  1. ArcCatalog puts a separate Metadata_Extensions element into the metadata every time you import it. So if you export, then re-import a couple of times, you will have the same elements repeated. Solution: open each record, find the first Metadata_Extensions if it exists. Since there may be more than one of these, look for one that specifies the ESRI extensions; it will have a Profile_Name of ESRI Metadata Profile. Now look at all of the other Metadata_Extensions elements; if any of them have the same value for Profile_Name, delete them.

    This is an operation you could do with a text editor or many of the typical metadata editors like Tkme. But it's mindless in the sense that you can specify the steps exactly, and it doesn't require you to think specifically about each record beyond following the procedure.

  2. Somebody leaves your organization. You need to find all of the metadata records that refer to the ex-employee's contact information and replace those Contact_Information sections with the contact info for somebody else. Contact_Information occurs in four different places: Point_of_Contact, Process_Contact, Distributor, and Metadata_Contact. To make things more complex, Process_Contact occurs within the repeatable element Process_Step so there may be many of those.
  3. People write Series_Name in all kinds of ways. For many publications, there's an official series name that really should be the one that people use. But you inherited this job from the person before you (see example 2 above), so you can't go back in time and have all of the metadata written right to begin with; you have to fix what's there. Citation_Information occurs in three different places: Citation, Cross_Reference, and Source_Citation; of these only Citation is not repeatable. Furthermore, you might need to look at what people have already put into Series_Name so that you can decide which non-preferred names go with each preferred series name. So first you write a script that collects all of the existing Series_Name values, then you rearrange that list so that you know which are preferred and which aren't. Then you go back through the records and replace the wrong names with the right ones.

Scripted editing of metadata

Because I encountered many of these problems, I turned to a scripting language to help me solve them. Using the code that runs Tkme and mp, I wrote mq, an extension to Tcl/Tk that allows me to read and write metadata using Tcl commands. I chose Tcl because I could figure out how to do this for that language. It's not an extremely complicated language, but it is a little weird in some ways. It works pretty well for this.
In order to use this method, you need to (1) install Tcl/Tk on your system; (2) install mq so that Tcl/Tk can find it; and (3) write your Tcl scripts so that they use mq. Here are the details; I'll assume you're using Microsoft Windows, but it would work as well (or better) on Linux:
  1. Go to http://www.activestate.com/Products/activeTcl/ and click Download; follow their instructions to install Tcl. I recommend installing it in C:\Tcl
  2. Go to https://geology.usgs.gov/tools/metadata/ and download the complete package for Windows, all_win.exe. Unpack this in C:\USGS.
  3. Open a command prompt window, and execute the following commands:
    cd C:\USGS\tools\bin
    copy mq26.dll C:\Tcl\lib
    mkdir C:\Tcl\lib\mq
    copy pkgIndex.tcl C:\Tcl\lib\mq
    
  4. In a command prompt window, type "tclsh". You should get a percent sign as the prompt. Then type "package require mq". It should give you the version number of mq, which is 2.6.12 at this writing. You can now write Tcl scripts that edit metadata.

Simple scripts

Tcl is an interpreted language. It is possible to run tclsh and simply type commands, which are executed immediately. For example:
C:\> tclsh
% puts "Hello, World!"
Hello, World!
% expr 2 + 2
4
% expr 355.0 / 113
3.14159292035
% exit
(To quit the Tcl interpreter, enter "exit"; "quit" doesn't do it.)
Programming with Tcl is not difficult, but there are things that must be kept in mind. The most challenging aspect is how quotes, braces, and brackets are interpreted. In particular, anything inside brackets [like this] gets executed and is replaced by its result. So in the example puts "Have some [expr 355.0 / 113]", the text between the braces is replaced by "3.14159292035", and that statement prints Have some 3.14159292035. See the Tcl Tutorial at http://www.tcl.tk/man/tcl8.5/tutorial/tcltutorial.html; there are a number of good books on Tcl as well.

Scripts using mq to edit metadata

  1. Basic metadata operations
    metadata m -parse c:/USGS/tools/doc/metadata/mp.met
    
    Returns 1 if mq managed to find and read the metadata record, 0 otherwise. So you can write
    if {[metadata m -parse c:/USGS/tools/doc/metadata/mp.xml]} {
        # do some things with this record
        }
    
    Notice that in this case I'm reading the XML version of the file; like mp, mq can read indented text or XML or SGML. Notice also that we use forward slashes for directory separators, not backslashes (use / instead of \). If this works, we use the variable m to work with the metadata. For example, to find the Identification_Information, we write
    $m find_first Identification_Information
    
    What we get from this is the address of the Identification_Information. Normally we need to store that address in a variable:
    set p [$m find_first Identification_Information]
    
    We don't manipulate that address directly, but we pass it to other mq functions, for example, to find an element within another:
    set q [$m find_in $p Cross_Reference]
    
    This will return the address of the first Cross_Reference element within the element p; the address of the Cross_Reference element is stored in the variable q.
    It's important to check the results of your searches:
    set p [$m find_first Identification_Information]
    if {$p} {
        set q [$m find_in $p Cross_Reference]
        }
    
    Since the Tcl set command returns the value you're setting, you can use it as a test in an if statement. This produces a more compact form:
    if {[set p [$m find_first Identification_Information]]} {
        if {[set q [$m find_in $p Cross_Reference]]} {
            # do something with the cross reference that is now in q
            }
        }
    
    To get one of the element's values:
    if {[set p [$m find_first Access_Constraints]]} {
        # Get the text of Access_Constraints, store in variable t
        set t [$m value_of $p]
        # print the value
        puts "$t"
        }
    
    To set an element's value:
    if {[set p [$m find_first Access_Constraints]]} {
        # Set the text of Access_Constraints to "None"
        $m value_set $p "None"
        }
    
    If you've made changes and you want to keep them, write the file. If you don't specify a file name, mq overwrites the original file. You can save the data in a format different from the one you read (so mq can be used to convert from indented text to XML and back again).
    $m write
    $m write my_copy.xml
    
    When you're working with many files, don't keep track of them all at once unless you really have to; instead, forget about each one as soon as you're done getting information from it:
    $m forget
    
    There are many more operations you can carry out on metadata. See mq's reference manual for a complete list.
  2. A simple script to open a metadata record and print the Title
    # Open the metadata record
    metadata m -parse C:/USGS/tools/doc/metadata/mp.met
    
    # Find the Citation
    if {[set p [$m find_first Citation]]} {
    
        # Find the title within the Citation.
        if {[set q [$m find_in $p Title]]} {
    
            # Get the value of the title and print it out.
            set t [$m value_of $q]
            puts "$t"
            } \
        else {
            # If no Title exists, complain.
            puts "This record has no title."
            }
        }
    # When done, forget about this record.
    $m forget
    
  3. A script that shows all of the Place_Keywords
    # Open the metadata record
    metadata m -parse C:/data/catfish.xml
    
    # Find the Keywords section.  There's only one of these.
    if {[set p [$m find_first Keywords]]} {
    
        # Look for Place within Keywords.  This can be repeated.
        if {[set q [$m find_in $p Place]]} {
    
            # If there is one Place, there may be more of them.
            while {$q} {
                # Find and print the Place_Keyword_Thesaurus, if any
                if {[set r [$m find_in $q Place_Keyword_Thesaurus]]} {
                    # Get the value and print it out.
                    set t [$m value_of $r]
                    puts "Place terms from thesaurus $t"
                    } \
                else {
                    puts "No Place_Keyword_Thesaurus within this Place!"
                    }
    
                # Look for Place_Keywords within this Place.  Repeats.
                set u [$m find_in $q Place_Keyword]
                while {$u} {
                    # Get the value and print it out.
                    set t [$m value_of $u]
                    puts "  $t"
    
                    # Find the next Place_Keyword.
                    set u [$m find_next $u]
                    }
    
                # Find the next Place section
                set q [$m find_next $q]
                }
            }
        }
    # When done, forget about this record.
    $m forget
    
  4. Recursing directories
    proc process_directory {long_name short_name} {
        recurse $long_name
        }
    
    proc process_file {long_name short_name} {
        if {[string compare [file extension $short_name] .met] == 0} {
            # operations on each metadata record go here
            }
        }
    
    proc recurse {dir} {
        global root
        foreach long_name [lsort [glob [file join $dir *]]] {
            scan $long_name "$root/%s" short_name
            if {[file isdirectory $long_name]} {
                process_directory $long_name $short_name
                } \
            else {
                process_file $long_name $short_name
                }
            }
        }
    
    set root [pwd]
    set short_name .
    recurse $root
    

Technical contact:

    Peter N. Schweitzer
    Mail Stop 954, National Center
    U.S. Geological Survey
    Reston, VA 20192

    Tel: (703) 648-6533
    FAX: (703) 648-6252
    Email: pschweitzer@usgs.gov