- Tools for creation of formal metadata
cns: A pre-parser for formal metadata
cns (
chew and spit) is a pre-parser for formal metadata
designed to assist metadata managers convert records that cannot be parsed
by
mp into records that can be parsed by mp. It
takes as input a poorly-formatted metadata file and, optionally, a list of
element aliases, and outputs (1) a metadata file that can be read by both
mp and xtme and (2) a file listing all of the lines that it couldn't
figure out where to put. Throwing a switch (
-v) causes it to put
out comments describing its decision-making process.
The source code with executables for Microsoft Windows and Linux
is available through
<https://geology.usgs.gov/tools/metadata/>
Assumptions
The input file has metadata elements identified by name; elements always
start with an element name, which is the first
alphabetic string
of characters on the line.
cns does
not assume that you
have used underscores in the standard element names, and if you have used
nonstandard names for standard elements, you can tell
cns how to
interpret them.
cns does not require indentation of any sort; its job is
to guess the indentation using its knowledge of the standard and
your arrangement of the elements. It may well guess wrong, so you'll have
to check the results carefully for every record.
Method
- cns reads the input file line-by-line and builds a parse tree using the same software that xtme and mp use to handle metadata internally. It keeps track of its position in the tree, and as it encounters new elements, it attempts to place them properly in the tree.
-
For each new element, cns asks whether the element can be
- a child (component) of the current element,
- a sibling of the current element, or
- an orphan grandchild of the current element.
If the element does not fit as one of these, cns moves one node up the tree towards the root node (Metadata) and repeats the procedure.
- If the element cannot be placed in this manner, the whole line is spit into a separate file I call leftovers, tagged with the number of the line in the input file.
- In addition to parseable output and leftovers, cns can optionally generate an information file that describes its thought processes.
Don't use cns blindly! You have to look carefully at the
output files and compare them to the input files to determine whether the
conversion faithfully represented your metadata. You may need to make
small changes to the original input file, and you may need to create
aliases. If the input metadata aren't arranged like the standard, the
process may not work well at all. [
More details]
Usage
cns [options] input_file
where
options are any or all of
-v
-c config_file
-i info_file
-a aliases
-e leftovers
-o output_file
- config_file
- This is the same file you would give to mp or xtme. The principal uses of a config file in cns are to force output of the main Metadata tag and to make cns aware of your local extensions.
- info_file
- This is where it will put the messages generated as a result of using -v. If -v is used and -i is not, info will go to stdout. If you don't specify -v, none of these messages are generated.
- aliases
- This is a file relating text strings likely to be found in the metadata with element names from the Standard. This file should be plain ASCII. Each line begins with an element name as mp expects it to be (underscores included) followed by one or more white spaces followed by an arbitrary string which, when found, will be recognized as representing the element named.
- leftovers
- This is where cns will put strings that it can't place. If no leftovers file is specified, they will go to stderr.
- output_file
- This should be readable by mp. It will not generally pass mp without error; we are dealing with nonconforming records that are likely to need at least some editing.
- input_file
- This is an arbitrary text file. This has to be plain ASCII, of course, but need not have any indentation.
Features:
- Skips over non-alphabetic characters at the beginnings of lines. This means that cns can skip over those old section numbers before the element name; it will ignore the number and find the name. If you're ingenious with the aliases file, you can have it skip additional text included in the record to help the metadata producer write the record, such as "(repeat as necessary)".
- Fills in missing container elements where only one level is missing. I tried to make it go three levels down but was not happy with the results. You can see what happens by adding -DThreeLevelLookDown in your compiler command line.
- Allows you to use spaces in the element names, any letter case, and aliases (for example, you can refer to Spatial_Domain as
Geographic Extent
).
- Identifies all actions by line number in the input file. With -v, cns tells what elements it recognized and what it did with them.
- Scalar values that cannot be placed but occur immediately following other scalar values are considered as text and included (like mp).
- Handles extensions to the standard using the same methods as mp and xtme; the same extensions description file works for all three tools.
Bugs:
Well, any program designed to take bad data and make it "less bad" in
several rather subjective respects isn't likely to satisfy every user.
However, I don't think it will dump core on you.
Suggested uses:
If you have a collection of metadata from various sources in various
formats with varying degrees of conformance and structure, this tool
may well save you some time by finding the structure of the metadata
and by substituting the proper element names where nonstandard names
have been used.
If you have been producing lots of metadata using a template that, while
well structured and comprehensive, can't be read by mp, you're in luck.
With a little effort up front, this program should help you to get these
records fully parseable and conformant with the best search engines of
the National Spatial Data Clearinghouse.
I've found there are some people who are being taught to use cns
as a matter of course. This is not a good idea.
The problem is that cns can misinterpret some
errors and then mix things up a bit, so you always have to look
carefully at the output file as well as the leftovers file. For
example, suppose I give it something like this:
Identification_Information:
Citation:
Citation_Information:
Originator: Squirrel, Rocket J.
Originator: Moose, Bulwinkle J.
Publication_Date: 1966
Tiitle:
Frozen in the Yukon without a dog
Series_Information:
Series_Name: RCMP Journal
Issue_Identification: 12:47
Notice that the element Title is misspelled. I put
two i's in it. cns will not notice this. Instead,
it will consider the misspelled Title element and the
title itself as part of the value of the Publication_Date
element. So the result looks like
Identification_Information:
Citation:
Citation_Information:
Originator: Squirrel, Rocket J.
Originator: Moose, Bulwinkle J.
Publication_Date:
1966
Tiitle:
Frozen in the Yukon without a dog
Series_Information:
Series_Name: RCMP Journal
Issue_Identification: 12:47
and this metadata record now has no title. cns can correct spelling
of elements only if you tell it what to look for and how to make the
correction. So in this case if I made an alias file containing
and ran cns like this:
cns test.txt -a alias -e leftovers -o output
then when cns sees Tiitle
, it recognizes that as a
misspelled form of Title and does the right thing.
The problem is that we don't usually know what has been misspelled
ahead of time. That's why cns is often used in an iterative
fashion: run it, check both input and output, run it again, check
again, and so on. mp, on the other hand, will immediately recognize
this misspelling and call attention to it, because mp thinks the
indentation is supposed to be correct, so that an element name
should appear where the word Tiitle
does.
So in general, cns should be used where you know the indentation is
a mess or where the element names have been written without the
underscores or where the section numbers precede the element names
or, more commonly, where all of these conditions are likely.
I always run mp first, and go to cns only if the file is
really badly formatted.
Tech support:
You WILL have to edit the output. My hope is that
cns will make
it easier to convert to mp-readable form than not having cns. Please
contact me for explanations of apparent misbehavior. I will need to see
the original file. Your comments and experiences are welcome.
Modifications
- Modified cns.c to believe (incorrectly) that Attribute is not a component of Range_Domain_Values and Enumerated_Domain_Values. This prevents unwanted recursion in lists of attributes.
- Modified cns.c to prevent orphan-grandchild lookdown at the root (Metadata) node. This prevents second-level element names appearing first on lines in textual values from inadvertently creating bogus first-level element names. Example: the word "Description" at the beginning of a line in the text of Entity_and_Attribute_Overview causes cns to create a bogus Identification_Information to contain it and spit the rest of the text. (19-Jun-1996)
- Modified main() in xtme.c, mp.c, and cns.c to read more than one local extensions file. This should enable people to choose more carefully which extensions will apply to a given input file. (5-Jul-1996)
- Modified insert_item_after() in tree.c to maintain proper topology (backward link wasn't made properly). Added functions update_element_lists() and show_element_lists() to actions.c and modified code in other functions there to better handle extensions. Modified cns.c to better handle blank lines. (16-Jul-1996)
- Modified best_match() in match.c to not return Wblank when given a blank string. Since cns skips numerals before calling best_match(), we don't want it to think that a line containing only numerals is actually blank, otherwise it forgets the number. (8-Jan-1997)
- Modified main() in cns.c to always call read_aliases() since this routine also incorporates extensions into the database of element names. Previously it would ignore extensions if you did not specify an alias file on the command line. Giving it an empty alias file also solves this problem. (28-Apr-1997)
- Modified cns.c to read the input file in one gulp, then feed lines to chew_and_spit() from the buffer as needed. This removes a limitation (4096 bytes) on the input line length. (10-Jun-1997)
- Modified chew_and_spit() in cns.c to skip only leading spaces, numerals, and periods when searching for the element names. (10-Jun-1997)
- Modified chew_and_spit() in cns.c to skip also hyphens. Don't know whether this is a good idea or not. (20-Aug-1997)
- Modified main() in cns.c to look for a string associated with "prefix" under "input" in the config file. Modified chew_and_spit() to require this prefix (if one was specified) when identifying element names. If no prefix is specified, behavior is as it has always been. (10-Sep-1998)
- Modified main() in cns.c to do find_child_option() rather than find_option() when looking for
prefix
. (18-Sep-1998)
- Modified main() in cns.c to use the name
cns.out
rather than /dev/tty
when no output file is specified by the user. (5-Oct-1998)
- Modified match.c to recognize new elements from CSDGM version 2. (20-Oct-1998)
- Modified main() in cns.c to handle extensions just like mp does. Modified config.c to recognize "info" element in "output" so you can specify the file name of the info file using the config file. In addition, cns now recognizes one or more command-line options of the form
-ext <extension>
; the value is a short file name extension that could be stripped off the end of the input file name to find the root name. Built-in extensions are .txt, .text, .met, and .bin. (18-Nov-1999)
- Modified main() in cns.c to fix bug introduced in last change causing the program to crash on Windows B-( Also modified cns.c to take its output filespec from
output:text:cns
rather than output:text:file
, since the output is typically used as input to mp, and the latter would cause mp to want to overwrite the file. (23-Nov-1999)
- Modified chew_and_spit() to catch another, rather unusual situation. If plain text occurs beneath a non-scalar element, this text is usually spit out. However, if the non-scalar element has only one child, and that child is scalar, we can infer that the text belongs to that child, so insert the missing child and assign the text to it. (30-Oct-2000)
- Added a function get_keyword_list() in keyword.c to allow cns to get the keys and texts of standard and bio profile elements as a straightforward array (code in keyword.c now maintains this info as an array of pointers to static structures). This allows cns to not maintain a separate copy of the texts of the element names. The practical effect is to make cns work with the biological data profile using
input:profile bio
in the config file rather than using the bdp as an extension to the standard. Thanks to Terry Giles of USGS for pointing this problem out. (31-Jan-2001)
Technical contact:
Peter N. Schweitzer
Mail Stop 954, National Center
U.S. Geological Survey
Reston, VA 20192
Tel: (703) 648-6533
FAX: (703) 648-6252
email: pschweitzer@usgs.gov