U.S. Geological Survey

USGS NSDI Training for creation of formal metadata
Workshop session two

Exercise 2: Pre-processing metadata with cns

This exercise is intended to introduce you to the practical use of cns, with an example metadata record that conforms well to the FGDC standard but is not formatted properly for input to mp. In each step below, you type the information that is shown in red type in a fixed-width font.

  1. Log in to the system
    login: meta01
    password:
    
    If you're using the standard Bourne shell, the system prompt will be a dollar sign, and if you're running the C shell or one of its variants, the prompt will be a percent sign.

  2. Look at the files in your home directory
    $ ls -laF
    
  3. Find the directory examples/kygs

    $ cd examples/kygs
    $ ls -laF
    
  4. Find the file hazard; this is Digital Geology of the Hazard 30´x 45´ Quadrangle, Kentucky

    $ ls -laF hazard
    
  5. Determine whether cns is properly installed
    $ cns
    Chew and Spit, v 1.4  19970610
    Usage: cns [-c config_file] [-i info_file] [-a aliases] [-e leftovers] [-o output_file] input_file
    
  6. Parse the metadata file with mp, displaying errors on the screen
    $ mp hazard
    
    Ouch! 1179 errors!

  7. The screen wraps the text and there are many messages. Redirect them to a file so that you can examine them more closely with a text editor.
    $ mp hazard -e hazard.err
    
  8. Look at the error file and the metadata using a text editor.
    $ xed hazard.err hazard
    
    In xed, the top line of the window is a "status line" that shows you what line you're on, what column your cursor is in, whether you're in Insert and Autoindent mode, the width that tabs will be expanded, and the name of the file you're editing (a '+' before the name means you've made changes to the file).

  9. Note that mp gives a lot of warnings about ambiguous indentation along with messages like "element Originator found in textual value, some information may be lost". Despite their being tagged as warnings, these messages indicate serious problems with the file's format and must not be ignored.

    1. Click the mouse on the status line to bring up the menus.

    2. Click File to open the File menu, and click Next file twice to switch to the next file. The file hazard appears.

    3. The indentation in the file is inconsistent. For example, the text of the abstract is flush with the left margin; it should be indented more than the element name Abstract. Similar problems abound throughout the file. However, the elements are generally in the right order and the values appear to be reasonable.

    4. You could attempt to fix this file with a text editor. Knowing the standard well, you could add indentation and, where needed, extra container elements such as the Citation_Information that should be inserted between lines 2 and 3. With an iterative process of editing and running mp, you can amend the format of this file to allow mp to analyze it more sensibly.

      There is a better way.

  10. cns was designed to help people rearrange files like this. Its job is to guess the structure of the metadata from the occurrence of recognizable elements in the file, setting aside those parts it cannot understand, and providing a detailed record of its analysis.

  11. Run cns, specifying the input file, and output file names for leftovers, information, and the cleaned-up metadata.
    $ cns hazard -e leftovers -i info -o output
    
  12. Examine the input and output files
    $ xed leftovers info output hazard
    
    1. Each line in the leftovers file has a number showing what line of the input file hazard the information came from.

    2. Look for groups of lines. In this case we see lines 12 through 17 in the leftovers file. Switch to the info file, and for lines 12 through 17 we see the message "text could not be placed".

    3. Switch to the input file hazard and look at lines 12-17. These look like standard metadata, but the elements shown here, Principal_Investigator and Digital_Compilers, aren't standard metadata elements.

    4. We'll assume that these elements are intended as extensions of the FGDC standard, so we'll create a file that lets cns know their names and where they should appear in the metadata. This information is stored in a separate file that cns, xtme, and mp all know how to read.

    5. We won't get into the structure of the extensions file in detail here. You'll find the right file in the same directory. It is called kygs.ext. Open it in xed by selecting Open from the File menu, press Enter, and select kygs.ext from the popup list.

    6. Notice that the extensions file also describes the extensions Coverage_name and Coverage_description. Look at the leftovers file again and find these element names. Which lines of the input file will be properly recognized when these elements are made known to cns?

    7. cns needs to be told where the extensions file is, and this is done through a configuration file. Since these extensions are the only unusual thing we need to tell cns, our configuration file will contain only a reference to the extensions file. Open the file kygs.cfg to see how this is specified.

    8. Close all of the files by repeatedly pressing F2 until the xed window disappears.

  13. Run cns again, specifying the configuration file along with the other files that were specified before.
    $ cns -c kygs.cfg hazard -e leftovers -i info -o output
    
  14. Examine the input and output files
    $ xed leftovers info output hazard
    
    1. Look at lines 21 and 70 in the input file. These lines contain the element names Description and Bounding_Coordinates. The other information contained on these lines is found elsewhere. In the case of Description the extra text is the title of the data set. With Bounding_Coordinates we see a hint probably used to help the metadata author enter the proper values. The extraneous information on these lines can be ignored, so don't worry that cns has discarded it.

    2. The next group of lines in the leftovers file is 895-903. These begin with what looks like an element name, Horizontal Coordinate System. Check the Alphabetical List of Compound Elements and Data Elements, part of the FGDC metadata standard, to determine that this is not one of the elements. It has been misspelled. But the lines that follow it contain elements that should be part of the element Grid_Coordinate_System. Let's assume that the metadata author has misspelled the element Grid_Coordinate_System as Horizontal Coordinate System: grid. Does cns have a way to handle misspelled elements? Of course!

    3. Open the file alias. On each line, the first word is the correctly-spelled name of a standard element. Following that word are one or more spaces and some text that will be recognized as an alias of that element. You'll see that I found nine misspelled elements in the file hazard.

    4. Go to line 4 of the file alias. What element name spelling will this line correct? Now go to line 1016 of the file output. You can see that because the name of a major section of the metadata was misspelled, all of the elements of that section have been considered by cns to be part of the value of the element Ellipsoid_Name. This is not what the metadata producer intended. With the alias list, cns will properly recognize this section. Let's try it.

    5. Close xed by repeatedly pressing F2.

  15. Since cns is the only program that understands aliases, the alias file is not specified in the configuration file but is named on the command line with the -a switch.

    Run cns yet again, specifying the alias file and the configuration file along with the other files that were specified before.

    $ cns -c kygs.cfg hazard -a alias -e leftovers -i info -o output
    
  16. Examine the input and output files
    $ xed leftovers info output hazard
    
    1. Note that the leftovers file now contains only two lines, 21 and 70, which we have already determined that we don't need.

    2. Scan through the file output and notice that it seems to be in good order. Time to try mp again!

  17. Run mp again, specifying the name of the configuration file, which lets mp know about the extensions.
    $ mp output -c kygs.cfg -e hazard.err
    
  18. Examine the file hazard.err
    $ xed hazard.err
    
    We're down to 36 errors, mostly missing, empty, or improper values. Not bad, coming from 1179. And we didn't have to modify the actual input file at all. The number can be further reduced using the prune function of xtme, but that's part of the next session.

  19. This completes the exercise.


This page is <URL:http://geology.usgs.gov/usgs/gdinfo/nsdi/training/2b.html>
Maintained by Peter Schweitzer
Last updated Thursday, 10-May-2012 16:01:29 EDT