Keywords in metadata

Keywords in metadata: what to expect from them, what not to expect; how to choose them; how to get the most value from them.

Keywords is one of the most useful sections of formal metadata, yet it is often misunderstood and consequently is less effective than it could be. This note is intended to help you understand keywords so that you can use them to make your data easier for users to find and understand. I wrote this about CSDGM ("FGDC") metadata, but the ideas apply equally well to ISO metadata.

Background:

Keywords is a section within Identification_Information, containing any number of subsections for Theme, Place, Stratum, and Temporal keywords. All of the subsections have similar structure, a thesaurus identifier indicating the source from which the keywords were drawn, and one or more keywords drawn from that source:
Identification_Information:
  Keywords:
    Theme:
      Theme_Keyword_Thesaurus: (official name for a vocabulary)
      Theme_Keyword: a term
      Theme_Keyword: another term
      Theme_Keyword: yet another term
    Place:
      Place_Keyword_Thesaurus:
      Place_Keyword:
      Place_Keyword:
      Place_Keyword:
    Stratum:
      Stratum_Keyword_Thesaurus:
      Stratum_Keyword:
      Stratum_Keyword:
      Stratum_Keyword:
    Temporal:
      Temporal_Keyword_Thesaurus:
      Temporal_Keyword:
      Temporal_Keyword:
      Temporal_Keyword:
In this note I'm going to focus on theme keywords because the CSDGM requires at least one Theme_Keyword and because theme keywords are more likely than the other types to be used effectively for finding your data.

What are keywords for?

Keywords are intended to categorize your data. That allows people and computers to compose lists of datasets that share some important characteristics that are meaningful to other people; your dataset will be included in some of those lists and not others. It is tempting to add keywords that have other purposes such as identification or description, but those keywords do not improve the categorization of the data and those purposes are better served by improving other sections of the metadata record.

You use keyword categories often, without thinking about it. When you go to the grocery store, you look for signs indicating which aisle contains soup, for example. Those words indicate types of things, and they are used because they help you separate things that are soup from things that are not soup. You might even use a store directory to tell you where to find the categories of things you need to buy.

Note what they are not for. Keywords are not necessary to support full- text search, because full-text search should match text anywhere in the metadata record. For example, if you want to search using the name of the project that generated the data, write the name of the project into Purpose or Data_Set_Credit; you don't need to put that name into a keyword.

Keywords will not specify all of the scientific problems to which your data might be applied. You only know what you wanted to do with the data (that's what you wrote in the Purpose, right?), and in the future, problems will arise for which your data are helpful, but you don't yet know what those problems will be.

While it's possible to use identifiers effectively in keyword sections, there are usually better places in the metadata to put those identifiers. Remember that an identifier doesn't say what the data are about, it says which data they are.

While they do usually describe your data, description is not the purpose of keywords. You've written a lot of text that describes the data; some of that should be in the title, some in the Abstract and Purpose, Process_Step, and especially in the Attribute_Definition elements. So you should not depend on keywords to describe the data.

What keywords should I use?

Keywords are provided by controlled vocabularies. A controlled vocabulary is a collection of terms chosen for a specific purpose with clearly indicated meanings and relationships. Those relationships can be important, as in a strict hierarchy where each narrower term is a type of or part of its broader term, or those relationships can be unimportant, as in an alphabetical list where each item in the list is no more strongly related to any term than to any other. For example, the ISO 19115 Topic Category vocabulary is a simple list of 19 values that should be used to fill in the TopicCategory field of ISO metadata. Its terms may be related, like climatologyMeteorologyAtmosphere is related to oceans, but the vocabulary itself does not specify how the terms are related. In contrast, the USGS Thesaurus is arranged in a strict hierarchy, so that an information search system can return related resources that were categorized using terms more specific than the one you asked for. So data in the category mine drainage are included if you search for pollution, because mine drainage is a type of pollution.

Many controlled vocabularies are available. You should use a vocabulary if its scope overlaps the ideas, methods, or other characteristics of your data, and if you might expect that vocabulary to be exploited well in the discovery interfaces provided by the organization.

Generally you should choose the most specific keywords that apply to your data. Ideally the web interfaces through which people discover your data will know about the vocabularies and can provide browse interfaces or search aids that take advantage of the hierarchical nature of the vocabularies.

This interactive section shows controlled vocabularies that are available through some web services. Use it to explore some of those vocabularies.
Enter text to match terms from any checked vocabulary

How can I assign keywords to my metadata?

When you're writing metadata, you should try to choose keywords that categorize your data well relative to other data produced by the organization. The keywords should be reviewed and possibly revised by another person who understands the data but who also has a broad knowledge of the larger collection of data produced by the organization. This is similar to books in libraries--book authors don't write the catalog records that people use to find their books in the library's catalog, that' s what library catalogers do, both because they understand the library system software and because they understand the library's collection.

That means try to assign good category terms, but don't agonize about them, instead the metadata author should help the reviewer understand the data. Many authors will acquire the broader perspective needed to assign good keywords because they're curious and careful people, but this step should not pose a major obstacle to the release of the data.

The mechanics of assigning keywords to metadata vary with software tools you use to edit metadata. The simplest method is to type the elements and values in using a text editor. More sophisticated tools can include keyword-selection software that relies on web services to help you choose keywords. For example, Tkme has a Keywords menu that is intended to help. The Metadata Wizard Toolbox draws on the same web services and has a somewhat different user interface. The Online Metadata Editor has (or will acquire) a specific interface to carry out this function as well.

Where else in the metadata elements should I use controlled vocabularies?

Controlled vocabularies should be used wherever it's practical in the metadata. Examples:

Series_Name element within Citation_Information
Should be drawn from a standard list of USGS publication series names if the publications came from the USGS.
Format_Name within Digital_Transfer_Information
Should be written consistently as well; this would benefit from a controlled vocabulary.
Metadata_Standard_Name and Metadata_Standard_Version
These are controlled vocabularies but their values are spread across several different documents.
Some of these can be evaluated using the secondary validation process of the Geospatial Metadata Validation Service.

How can I see the value of keywords?

The best reason to use controlled vocabularies when choosing keywords is so that the meaning and spelling of the terms you chose will match those that other people use in their metadata. When many metadata records use the same set of controlled terms, those terms can be shown as links to people who are looking for data, so that the data seekers don't have to guess what terms we used and they don't have to guess how we spelled them. If the controlled vocabulary is hierarchical, we can also give the users options to choose broader terms or narrower terms, so they can drill down to topics that interest them and see what data we have pertaining to those topics.

A category browse web interface

Here is an example in which a variety of category terms are shown as entry points from which users can choose metadata records or, if they wish, can navigate to more appropriate category terms:

Alphabetical indexes

Topics

How this example works:
  1. Metadata records are assigned keywords using a few consistent controlled vocabularies (here the USGS Thesaurus, Alexandria Digital Library Feature Type Thesaurus, Common Geographic Areas, and USGS Publication Series).
  2. A small database (here SQLite) is created to hold the title, url, description, and keywords of each metadata record.
  3. A special index is created that specifies, for any category term, which metadata records have been assigned that term or any of its narrower terms.
  4. An assortment of terms is chosen as entry links, these are grouped heuristically for display on the web index page.
  5. A category browse page is created that shows, for a given category, the broader, narrower, and related categories, along with a list of metadata records pertinent to the category, for example contamination and pollution.
  6. An alphabetical listing of terms is also given to enable people to find categories alphabetically. In this case the alphabetical listing is designed as "keyword in context", so an entry appears for every word in each category term, not just the first word.
Note that this web interface is rich with well-identified categorical links and substantive textual information. This makes an interface like this an excellent target for text-based search engines. So the metadata presented there is available through this browse interface, is commonly indexed and available through external search engines, and is also available through the interfaces provided by the USGS Science Data Catalog and, in principle, the DOI data catalog and data.gov.