A pre-parser for formal metadata

cns: A pre-parser for formal metadata

cns (chew and spit) is a pre-parser for formal metadata designed to assist metadata managers convert records that cannot be parsed by mp into records that can be parsed by mp. It takes as input a poorly-formatted metadata file and, optionally, a list of element aliases, and outputs (1) a metadata file that can be read by both mp and xtme and (2) a file listing all of the lines that it couldn't figure out where to put. Throwing a switch (-v) causes it to put out comments describing its decision-making process.

The source code with executables for Microsoft Windows and Linux is available through <https://geology.usgs.gov/tools/metadata/>

Assumptions

The input file has metadata elements identified by name; elements always start with an element name, which is the first alphabetic string of characters on the line. cns does not assume that you have used underscores in the standard element names, and if you have used nonstandard names for standard elements, you can tell cns how to interpret them.

cns does not require indentation of any sort; its job is to guess the indentation using its knowledge of the standard and your arrangement of the elements. It may well guess wrong, so you'll have to check the results carefully for every record.

Method

cns reads the input file line-by-line and builds a parse tree using the same software that xtme and mp use to handle metadata internally. It keeps track of its position in the tree, and as it encounters new elements, it attempts to place them properly in the tree.
For each new element, cns asks whether the element can be
1. a child (component) of the current element,
2. a sibling of the current element, or
3. an orphan grandchild of the current element.
If the element does not fit as one of these, cns moves one node up the tree towards the root node (Metadata) and repeats the procedure.
If the element cannot be placed in this manner, the whole line is spit into a separate file I call leftovers, tagged with the number of the line in the input file.
In addition to parseable output and leftovers, cns can optionally generate an information file that describes its thought processes.

Don't use cns blindly! You have to look carefully at the output files and compare them to the input files to determine whether the conversion faithfully represented your metadata. You may need to make small changes to the original input file, and you may need to create aliases. If the input metadata aren't arranged like the standard, the process may not work well at all. [More details]

Usage

cns [options] input_file

where options are any or all of

    -v
    -c config_file
    -i info_file
    -a aliases
    -e leftovers
    -o output_file

config_file: This is the same file you would give to mp or xtme. The principal uses of a config file in cns are to force output of the main Metadata tag and to make cns aware of your local extensions.
info_file: This is where it will put the messages generated as a result of using -v. If -v is used and -i is not, info will go to stdout. If you don't specify -v, none of these messages are generated.
aliases: This is a file relating text strings likely to be found in the metadata with element names from the Standard. This file should be plain ASCII. Each line begins with an element name as mp expects it to be (underscores included) followed by one or more white spaces followed by an arbitrary string which, when found, will be recognized as representing the element named.
leftovers: This is where cns will put strings that it can't place. If no leftovers file is specified, they will go to stderr.
output_file: This should be readable by mp. It will not generally pass mp without error; we are dealing with nonconforming records that are likely to need at least some editing.
input_file: This is an arbitrary text file. This has to be plain ASCII, of course, but need not have any indentation.

Features:

Skips over non-alphabetic characters at the beginnings of lines. This means that cns can skip over those old section numbers before the element name; it will ignore the number and find the name. If you're ingenious with the aliases file, you can have it skip additional text included in the record to help the metadata producer write the record, such as "(repeat as necessary)".
Fills in missing container elements where only one level is missing. I tried to make it go three levels down but was not happy with the results. You can see what happens by adding -DThreeLevelLookDown in your compiler command line.
Allows you to use spaces in the element names, any letter case, and aliases (for example, you can refer to Spatial_Domain as Geographic Extent).
Identifies all actions by line number in the input file. With -v, cns tells what elements it recognized and what it did with them.
Scalar values that cannot be placed but occur immediately following other scalar values are considered as text and included (like mp).
Handles extensions to the standard using the same methods as mp and xtme; the same extensions description file works for all three tools.

Bugs:

Well, any program designed to take bad data and make it "less bad" in several rather subjective respects isn't likely to satisfy every user. However, I don't think it will dump core on you.

Suggested uses:

If you have a collection of metadata from various sources in various formats with varying degrees of conformance and structure, this tool may well save you some time by finding the structure of the metadata and by substituting the proper element names where nonstandard names have been used.

If you have been producing lots of metadata using a template that, while well structured and comprehensive, can't be read by mp, you're in luck. With a little effort up front, this program should help you to get these records fully parseable and conformant with the best search engines of the National Spatial Data Clearinghouse.

Why NOT to use cns

I've found there are some people who are being taught to use cns as a matter of course. This is not a good idea. The problem is that cns can misinterpret some errors and then mix things up a bit, so you always have to look carefully at the output file as well as the leftovers file. For example, suppose I give it something like this:

Identification_Information:
  Citation:
    Citation_Information:
      Originator: Squirrel, Rocket J.
      Originator: Moose, Bulwinkle J.
      Publication_Date: 1966
      Tiitle:
        Frozen in the Yukon without a dog
      Series_Information:
        Series_Name: RCMP Journal
        Issue_Identification: 12:47

Notice that the element Title is misspelled. I put two i's in it. cns will not notice this. Instead, it will consider the misspelled Title element and the title itself as part of the value of the Publication_Date element. So the result looks like

Identification_Information:
  Citation:
    Citation_Information:
      Originator: Squirrel, Rocket J.
      Originator: Moose, Bulwinkle J.
      Publication_Date:
        1966
        Tiitle:
        Frozen in the Yukon without a dog
      Series_Information:
        Series_Name: RCMP Journal
        Issue_Identification: 12:47

and this metadata record now has no title. cns can correct spelling of elements only if you tell it what to look for and how to make the correction. So in this case if I made an alias file containing

Title  Tiitle

and ran cns like this:


cns test.txt -a alias -e leftovers -o output

then when cns sees Tiitle, it recognizes that as a misspelled form of Title and does the right thing.

The problem is that we don't usually know what has been misspelled ahead of time. That's why cns is often used in an iterative fashion: run it, check both input and output, run it again, check again, and so on. mp, on the other hand, will immediately recognize this misspelling and call attention to it, because mp thinks the indentation is supposed to be correct, so that an element name should appear where the word Tiitle does.

So in general, cns should be used where you know the indentation is a mess or where the element names have been written without the underscores or where the section numbers precede the element names or, more commonly, where all of these conditions are likely. I always run mp first, and go to cns only if the file is really badly formatted.

Tech support:

You WILL have to edit the output. My hope is that cns will make it easier to convert to mp-readable form than not having cns. Please contact me for explanations of apparent misbehavior. I will need to see the original file. Your comments and experiences are welcome.

Modifications

Modified cns.c to believe (incorrectly) that Attribute is not a component of Range_Domain_Values and Enumerated_Domain_Values. This prevents unwanted recursion in lists of attributes.
Modified cns.c to prevent orphan-grandchild lookdown at the root (Metadata) node. This prevents second-level element names appearing first on lines in textual values from inadvertently creating bogus first-level element names. Example: the word "Description" at the beginning of a line in the text of Entity_and_Attribute_Overview causes cns to create a bogus Identification_Information to contain it and spit the rest of the text. (19-Jun-1996)
Modified main() in xtme.c, mp.c, and cns.c to read more than one local extensions file. This should enable people to choose more carefully which extensions will apply to a given input file. (5-Jul-1996)
Modified insert_item_after() in tree.c to maintain proper topology (backward link wasn't made properly). Added functions update_element_lists() and show_element_lists() to actions.c and modified code in other functions there to better handle extensions. Modified cns.c to better handle blank lines. (16-Jul-1996)
Modified best_match() in match.c to not return Wblank when given a blank string. Since cns skips numerals before calling best_match(), we don't want it to think that a line containing only numerals is actually blank, otherwise it forgets the number. (8-Jan-1997)
Modified main() in cns.c to always call read_aliases() since this routine also incorporates extensions into the database of element names. Previously it would ignore extensions if you did not specify an alias file on the command line. Giving it an empty alias file also solves this problem. (28-Apr-1997)
Modified cns.c to read the input file in one gulp, then feed lines to chew_and_spit() from the buffer as needed. This removes a limitation (4096 bytes) on the input line length. (10-Jun-1997)
Modified chew_and_spit() in cns.c to skip only leading spaces, numerals, and periods when searching for the element names. (10-Jun-1997)
Modified chew_and_spit() in cns.c to skip also hyphens. Don't know whether this is a good idea or not. (20-Aug-1997)
Modified main() in cns.c to look for a string associated with "prefix" under "input" in the config file. Modified chew_and_spit() to require this prefix (if one was specified) when identifying element names. If no prefix is specified, behavior is as it has always been. (10-Sep-1998)
Modified main() in cns.c to do find_child_option() rather than find_option() when looking for prefix. (18-Sep-1998)
Modified main() in cns.c to use the name cns.out rather than /dev/tty when no output file is specified by the user. (5-Oct-1998)
Modified match.c to recognize new elements from CSDGM version 2. (20-Oct-1998)
Modified main() in cns.c to handle extensions just like mp does. Modified config.c to recognize "info" element in "output" so you can specify the file name of the info file using the config file. In addition, cns now recognizes one or more command-line options of the form -ext <extension>; the value is a short file name extension that could be stripped off the end of the input file name to find the root name. Built-in extensions are .txt, .text, .met, and .bin. (18-Nov-1999)
Modified main() in cns.c to fix bug introduced in last change causing the program to crash on Windows B-( Also modified cns.c to take its output filespec from output:text:cns rather than output:text:file, since the output is typically used as input to mp, and the latter would cause mp to want to overwrite the file. (23-Nov-1999)
Modified chew_and_spit() to catch another, rather unusual situation. If plain text occurs beneath a non-scalar element, this text is usually spit out. However, if the non-scalar element has only one child, and that child is scalar, we can infer that the text belongs to that child, so insert the missing child and assign the text to it. (30-Oct-2000)
Added a function get_keyword_list() in keyword.c to allow cns to get the keys and texts of standard and bio profile elements as a straightforward array (code in keyword.c now maintains this info as an array of pointers to static structures). This allows cns to not maintain a separate copy of the texts of the element names. The practical effect is to make cns work with the biological data profile using input:profile bio in the config file rather than using the bdp as an extension to the standard. Thanks to Terry Giles of USGS for pointing this problem out. (31-Jan-2001)

Technical contact:

Peter N. Schweitzer
  Mail Stop 954, National Center
  U.S. Geological Survey
  Reston, VA 20192

  Tel: (703) 648-6533
  FAX: (703) 648-6252
  email: pschweitzer@usgs.gov