Formal metadata: information and software
Formal metadata information and software
Online systems for handling metadata need to rely on their (metadata is plural, like data) being predictable in both form and content. Predictability is assured only by conformance to standards. The standard referred to in this document is the Content Standard for Digital Geospatial Metadata. I refer to this as the FGDC standard even though FGDC deals with other standards as well, such as the Spatial Data Transfer Standard (SDTS).
Metadata that conform to the FGDC standard are the basic product of the National Geospatial Data Clearinghouse, a distributed online catalog of digital spatial data. This clearinghouse will allow people to understand diverse data products by describing them in a way that emphasizes aspects that are common among them.
How do we deal with people who complain that it's too hard? The solution in most cases is to redesign the work flow rather than to develop new tools or training. People often assume that data producers must generate their own metadata. Certainly they should provide informal, unstructured documentation, but they should not necessarily have to go through the rigors of fully-structured formal metadata. For scientists or GIS specialists who produce one or two data sets per year it simply isn't worth their time to learn the FGDC standard. Instead, they should be asked to fill out a less- complicated form or template that will be rendered in the proper format by a data manager or cataloger who is familiar (not necessarily expert) with the subject and well-versed in the metadata standard. If twenty or thirty scientists are passing data to the data manager in a year, it is worth the data manager's time to learn the FGDC standard. With good communication this strategy will beat any combination of software tools and training.
There are 334 different elements in the FGDC standard, 119 of which exist only to contain other elements. These compound elements are important because they describe the relationships among other elements. For example, a bibliographic reference is described by an element called Citation_Information which contains both a Title and a Publication_Date. You need to know which publication date belongs to a particular title; the hierarchical relationship described by Citation_Information makes this clear.
Begin excerpt from Hugh Phillips
Over the past several months there have been several messages posted in regard to Metadata 'Core.' Several messages reflected frustration with the complexity of the CSDGM and suggested the option of a simplified form or 'Core' subset of the full standard. At the other end of the spectrum was concern the full standard already is the 'Core' in that it represents the information necessary to evaluate, obtain, and use a data set.
One suggestion has been for the definition of a 'Minimum Searchable Set' i.e. the fields which Clearinghouse servers should index on, and which should be individually searchable. There have been proposals for this set, e.g. the Dublin Core or the recently floated 'Denver Core.' The suggested fields for the 'Denver Core' include:
Theme_Keywords Place_Keywords Bounding_Coordinates Abstract Purpose Time_Period_of_Content Currentness_Reference Geospatial_Data_Presentation_Form Originator Title Language Resource_DescriptionLanguage (for the metadata) is an element not currently appearing the CSDGM. I have no problem with the Denver Core as a Minimum Searchable Set, it is mainly just a subset of the mandatory elements of the CSDGM, and hence should always be present.
In contrast, I am very much against the idea of defining a Metadata Content 'Core' which represents a subset of the CSDGM. If this is done, the 'Core' elements will become the Standard. No one will create metadata to the full extent of the Standard and as a result it may be impossible to ascertain certain aspects of a data set such as its quality, its attributes, or how to obtain it. I have sympathy for those who feel that the CSDGM is onerous and that they don't have time to fully document their data sets. Non-federal agencies can do whatever parts of the CSDGM they want to and have time for. As has been said, 'There are no metadata police.' However, whatever the reason for creating abbreviated metadata, it shouldn't be validated by calling it 'Core.' 'Hollow Core' maybe.
Okay. Let us cast aside the term 'Core' because it seems like sort of a loaded word. The fact is, there are many people and agencies who want a shortcut for the Standard because "It's too hard" or because they have "Insufficient time."
"It's too hard" is a situation resulting from lack of familiarity with the CSDGM and from frustration with its structural overhead. This could be remedied if there were more example metadata and FAQs available to increase understanding, through the act of actually trying to follow through the standard to the best of ones ability, and metadata tools that insulated the user from the structure. The first data set documented is always the worst. The other aspect to "Its too hard" is that documenting a data set fully requires a (sometimes) uncomfortably close look at the data and brings home the realization of how little is really known about its processing history.
"Insufficient time" to document data sets is also a common complaint. This is a situation in which managers who appreciate the value of GIS data sets can set priorities to protect their data investment by allocating time to document it. Spending one or two days documenting a data set that may have taken months or years to develop at thousands of dollars in cost hardly seems like an excessive amount of time.
These 'pain' and 'time' concerns have some legitimacy, especially for agencies that may have hundreds of legacy data sets which could be documented, but for which the time spent documenting them takes away from current projects. At this point in time, it seems much more useful to have a lot of 'shortcut' metadata rather a small amount of full blown metadata. So what recommendations can be made to these agencies with regard to a sort of 'minimum metadata' or means to reduce the documentation load?
Example: in the NBII, Taxonomy is a component of Metadata, and is the root of a subtree describing biological classification.
Example: Do not add elements to Supplemental_Information; that field is defined as containing free text.
Example: Description contains the elements Abstract, Purpose, and Supplemental_Information. These components must not be replaced with free text.
Example: To indicate contact information for originators who are not designated as the Point_of_Contact, create an additional element Originator_Contact, consisting of Contact_Information. But the element Point_of_Contact is still required even if the person who would be named there is one of the originators.
First you have to understand both the data you are trying to describe and the standard itself. Then you need to decide about how you will encode the information. Normally, you will create a single disk file for each metadata record, that is, one disk file describes one data set. You then use some tool to enter information into this disk file so that the metadata conform to the standard. Specifically,
The FGDC standard is truly a content standard. It does not dictate the layout of metadata in computer files. Since the standard is so complex, this has the practical effect that almost any metadata can be said to conform to the standard; the file containing metadata need only contain the appropriate information, and that information need not be easily interpretable or accessible by a person or even a computer.
This rather broad notion of conformance is not very useful. Unfortunately it is rather common. Federal agencies wishing to assert their conformance with the FGDC standard need only claim that they conform; challenging such a claim would seem to be petty nitpicking. But to be truly useful, the metadata must be clearly comparable with other metadata, not only in a visual sense, but also to software that indexes, searches, and retrieves the documents over the internet. For real value, metadata must be both parseable, meaning machine-readable, and interoperable, meaning they work with software used in the Clearinghouse.
To parse information is to analyze it by disassembling it and recognizing its components. Metadata that are parseable clearly separate the information associated with each element from that of other elements. Moreover, the element values are not only separated from one another but are clearly related to the corresponding element names, and the element names are clearly related to each other as they are in the standard.
In practice this means that your metadata must be arranged in a hierarchy, just as the elements are in the standard, and they must use standard names for the elements as a way to identify the information contained in the element values.
To operate with software in the Clearinghouse, your metadata must be readable by that software. Generally this means that they must be parseable and must identify the elements in the manner expected by the software.
The FGDC Clearinghouse Working Group has decided that metadata should be exchanged in Standard Generalized Markup Language (SGML) conforming to a Document Type Declaration (DTD) developed by USGS in concert with FGDC.
You can create metadata in SGML using a text editor. However, this is not advisable because it is easy to make errors, such as omitting, misspelling, or misplacing the tags that close compound elements. These errors are difficult to find and fix. Another approach is to create the metadata using a tool that understands the Standard.
One such tool is Xtme (which stands for Xt Metadata Editor). This editor runs under UNIX with the X Window System, version 11, release 5 or later. Its output format is the input format for mp (described below).Hugh Phillips has prepared an excellent summary of metadata tools, including reviews and links to the tools and their documentation. It is at <http://sco.wisc.edu/wisclinc/metatool/>
So the variable Quality1 exists only to indicate that some values of Measurement1 are questionable. Note that values of Measurement2 are not qualified in this way; variations in the quality of Measurement2 are presumably described in the metadata.
In summary, the Attribute component of Range_Domain and Enumerated_Domain allow the metadata to describe data in which some attribute qualifies the value of another attribute.
I agree with Doug that this describes data with more structural detail than many people expect, and in the case I described there were so many variables (430) in the data set that I quickly gave up on the entire Detailed_Description and provided an Overview_Description instead. If we had some fancy tools (Visual Data++?) that understood relationships among attributes like this, people would be more interested in providing the metadata in this detailed manner. Nevertheless I think the basic idea makes sense.
There are two solutions. The first is to "fix the standard" by using an extension. For example, I could define an extension as
Local: Name: Process_Time_Period Parent: Process_Step Child: Time_Period_Information SGML: procperThen to describe something that happened between 1960 and 1998, I could write
... Process_Step: Process_Description: what happened over these many years... Process_Date: 1998 Process_Time_Period: Time_Period_Information: Range_of_Dates/Times Beginning_Date: 1960 Ending_Date: 1998This is elegant in its way, but is likely to be truly effective only if many people adopt this convention. A more practical solution for the present would be to skirt the rules about the content of the Process_Date element. In this example, I would just write
... Process_Step: Process_Description: what happened over these many years... Process_Date: 1960 through 1998Now see that the value of Process_Date begins with a proper date, and contains some additional text. So any software that looks at this element will see a date, and may complain that there's more stuff there, but will at least have that first date. That's what mp does; if it finds a date, it won't complain about any additional text it finds after the date.
We should aim to handle metadata using SGML in the future, but I should continue to develop mp and its relatives, ensuring that my tools support the migration to SGML. We need much more expertise devoted to SGML development, and that isn't happening yet. For practical purposes the more complete solution at the moment is xtme->mp or cns->xtme->mp. These tools handle arbitrary extensions already, and mp can create SGML output if needed for subsequent processing. Where possible, we should encourage agencies to invest in the development of tools for handling metadata in SGML, but this isn't a "buy it" problem, it's a "learn it" problem--much more expensive. With the upcoming revision of the metadata standard, we need to build a DTD that can be easily extended.
In principle, you could create elaborate rules to check MIA dependencies, but I think that would complicate mp too much, making it impossible to support and maintain.
Title: Geometeorological data collected by the USGS Desert Winds Project at Gold Spring, Great Basin Desert, northeastern Arizona, 1979 - 1992
Title: Geometeorological data collected by the USGS Desert Winds Project at Gold Spring, Great Basin Desert, northeastern Arizona, 1979 - 1992
Note that mp now provides "preformatting" in which groups of lines that begin with greater-than symbols will be rendered preformatted, prefaced with <pre> and followed by </pre> in the HTML output. The leading >'s will be omitted from the HTML output. For example, the following metadata element
Completeness_Report: Data are missing for the following days >19890604 >19910905 >19980325will be rendered as follows in HTML:
<dt><em>Completeness_Report:</em> <dd> <pre> 19890604 19910905 19980325 </pre>
Also, I would point out that during the two years of its existence mp has a better support history than many of the other tools for producing metadata (see mp-doc). Corpsmet and MetaMaker are probably the next-best-supported tools. The PowerSoft-based NOAA tool was created by contractors who have since disappeared. USGS-WRD tried to pass maintenance of DOCUMENT off to ESRI, and ESRI hasn't made needed improvements; Sol Katz (creator of blmdoc) still works for BLM but has been assigned to other work. None of the other tools seems to have gotten wide acceptance. Paying contractors to write software seems to carry no guarantee that the software will be adequately supported. Home-grown software carries no guarantee either. Whether you "pays your money" or not, you still "takes your chances".
On the other hand...
The source code of mp is freely available. It has been built for and runs on many systems--I support 6 different varieties of Unix, MS-DOS, and Win95+NT, and I know it is working on several other Unix systems. The task of updating it might be daunting for an individual not conversant in C, but if I were hit by a truck tomorrow, the task wouldn't likely fall to an individual--it would be a community effort because lots of people have come to depend on it.
This shouldn't be necessary, since metadata are best printed from one of the HTML formats, and the web browser will wrap the text to fit the screen and page. However, for those who really want to have the plain text version fit within an 80-column page, there is a way to do it. Use a config file, with an output section, and within that a text section. Within output:text, specify wrap 80 like this:
output text wrap 80You don't have to use 80. I think it looks better with a narrower page, like 76. mp factors in the indentation of each line, assuming 2 spaces per level of indentation. Blank lines are preserved. Any line beginning with the greater-than sign > is preserved as is.
Note that this affects only the text output. Neither mp nor cns ever modifies the input file. But if you like the resulting text file, you can replace your input file with it.
CREATE TABLE attribute ( pk_attribute key_t PRIMARY KEY fk_enumerated_domain key_t REFERENCES enumerated_domain attribute_stuff ... ) CREATE TABLE enumerated_domain ( pk_enumerated_domain key_t PRIMARY KEY fk_attribute key_t REFERENCES attribute enumerated_domain_stuff ... )where key_t is a type for storing unique identifiers (e.g., Informix's SERIAL).
The tricky part, of course, is getting the information back OUT again. It's true, you can't write a query in standard SQL-92 that will traverse the tree implicit in the above example (i.e., will ping-pong between fk_enumerated_domain and fk_attribute until fk_attribute is NULL.)
However, most (all?) DBMS vendors support procedural extensions (e.g., looping) to SQL, which make the query possible. Additionally, some vendors have extended SQL to directly support tree-structured information (e.g., Oracle's CONNECT BY.)
Ultimately, you have to consider why you're storing FGDC metadata in a relational database. As we learned on the Alexandria Project:
I think how you handle it depends on what you do with the data:
Use their metadata. No real need to change it, but if you do some non-destructive change like reprojection, just add a Process_Step to the metadata indicating what you did. You can even add a Process_Contact with your info so that anyone who has questions about that particular operation can ask questions.
Start with their metadata. Take the Contact_Information in Point_of_Contact, and move it to all of the Process_Steps that don't already have a Process_Contact. Replace Point_of_Contact with yourself. Take Metadata_Contact, move it into a new Process_Step whose description is "create initial metadata", where Process_Date is the previous value of Metadata_Date. Modify other parts of the metadata to reflect your changes to the data (document these in your own Process_Step, too), then make yourself the Metadata_Contact. Tag--you're IT!
Use the existing metadata record to create a Source_Information which you will annotate (Source_Contribution) to describe how you incorporated this layer in your own work. Put this Source_Information into a new metadata record that describes your data; it will thus properly attribute the work of the people who created the source data.
It depends on what sort of errors they are. ArcCatalog, like Tkme, must allow you to create metadata with errors such as missing elements and empty elements. If I'm using a metadata editor, I don't want it to refuse to work if I merely leave something out--I might want to work in stages, adding some information now and more information later.
What's more important, of course, is that ArcCatalog has no way to know whether what people type into it is actually correct (meaning what you say about the data--is it right?). So we don't want people to rely on mp alone to judge the correctness of metadata. We should instead use mp to help us find out what we've left out or done wrong in the structure of the metadata, and then we have to read the metadata itself to figure out whether it actually describes the data well.
There is one way that valid metadata from ArcCatalog might be judged incorrect by mp, however. If I create metadata in ArcCatalog, then read it with mp but without telling mp that the metadata record uses ESRI extensions, then mp will complain that some of the elements aren't recognized. For example, ESRI includes in the metadata an element called Attribute_Type that tells whether a given attribute is an integer, character, or floating-point variable. This isn't in the FGDC standard, so mp will complain when it sees this element in the metadata. The fix is to tell mp you're using the ESRI extensions. A config file can be used for this purpose.