Editorial: Can cns fix an HTML file?

A user outside USGS asked:
I am new to metadata and have just started working with cns. Is there a way to convert a .htm through cns?

Reply by Peter Schweitzer on 14 March 2001:

The answer is probably "not in the way you might hope". However, the real answer depends on what you really need to accomplish. What cns does is read a text file looking for metadata element names; when it finds one at the beginning of a line, it tries to incorporate that element into its understanding of the metadata. It builds a tree (= outline) in memory, and keeps track of what branch it's on. So when it finds an element name, it asks itself whether it is allowed to make that element a new branch from the current branch, a "sibling" of the current branch, or not. It's a little more complicated than that, but the point is that it does a limited amount of thinking, and cannot really anticipate what people have written. Its purpose is to clean up a variety of specific "mistakes" that people often make when they create metadata using word processors. It does a helpful job in those specific circumstances, but there are a lot of mistakes that might be obvious to people that it can't figure out--it can exercise only limited discretion.

The HTML files that are often amenable to processing with cns are those that were produced by mp or by something like mp. Occasionally someone will land in a job where they are asked to clean up some older metadata on the web. With a little investigation, they discover that the text files from which the HTML pages were generated (mp generates HTML) have been discarded. Consequently they are left with only the HTML. The procedure in this case is to save the HTML from a browser as text, then run cns and clean up manually what cns missed.

This situation is not common, however. I fear that your task might be more difficult. If the information in your HTML pages doesn't use the FGDC element names and is arranged quite differently than the FGDC metadata, you'll need to use another approach to convert them to FGDC structure and format. In that case here's the procedure I recommend:

  1. Study some resources on my sites:
  2. Look carefully at the metadata you have. Look for concepts that are similar to those in the FGDC metadata; try to make a " crosswalk" table that indicates what elements in your metadata match those in the FGDC. Some won't; you may find that you'll have to pick apart the text to get the specific information of an FGDC field from a more general field in your metadata.
  3. With the Metadata in Plain Language page in a web browser, and your metadata in another web browser window, open Tkme or some other metadata editor. Using the questions as a guide to help you decide what information to transfer first, copy text from your metadata record into the appropriate elements in your FGDC-metadata editor. The first time you do this it will take a long time. If your existing info is fairly consistent, it gets easier as you go. But there's a lot to learn.
  4. When you get to a stopping point, run mp on the resulting metadata. Look at the error report, and also generate the FAQ-style HTML. See if anything doesn't make sense.
  5. Email me with questions when you get stuck. This is hard but it's good work to do; you'll often find things in the original information that you want to clarify anyway.