Editorial: Keywords and thesauri

A user outside USGS asked:

Can you comment on the need to use a thesaurus and resources for appropriate thesauri?

Reply by Peter Schweitzer on 18 Feb 2011:

This has been an intriguing but elusive subject for most of the past 16 years, since the FGDC metadata standard came into being. Here are the basic issues:

Everybody agrees when we label things consistently, it's easier to find them later.
Everybody agrees that there are different types of labels, corresponding to different questions:
- What topic does this data set describe?
- What geographic area do the data cover or occur in?
- What time period do the data represent?

Beyond these simple ideas, agreement begins to fall away. Specifically, people do not agree on how consistent labels are best used. Most people assume that type-in-the-box search is the only way anybody finds anything. I believe this assumption arises from the limited perspective offered by the most common web search engines, and that people remember only the times that they typed something into Google and got what they expected right away. They forget all the times (or blame themselves) when they couldn't figure out what to type because they don't know what to call the things they're looking for or they simply don't have the knowledge they need to begin to address, systematically, the problem they're faced with. Consequently most of the discussion about consistent keywords is fraught with anxiety--we know we want them, but we don't see how, in practical terms, they actually help, because we're thinking through the tunnel of the search interface.

The solution, in my opinion, is to back away from the problem and ask some more fundamental questions. What do people do when they don't actually have a well-formulated statement of their own needs for information? How do people find something that they don't know the name of? How do people find information about a physical or biological phenomenon when they don't know where it occurs? In the Google world, what actually happens, I believe, is that people do much more thinking, reading, trial-and-error thrashing and floundering on the web than they want to admit to.

One of my own experiences serves as a good example. My car is an Isuzu Rodeo, about 10 years old. The rear window opens upward from the tailgate, which swings outward to the left as you stand behind the car. A few years ago I noticed that the window was falling down, especially on cold nights, because the little black things that hold the window up had lost their holding power. Being a person who likes to fix things myself if I can, I thought maybe I could replace those things that hold the window up. So I go to the web. What am I looking for? I don't know what they're called. At the Google search page, I type "window lifter" or "car window lifter" or something like that. You can readily imagine this doesn't get me anything useful. I try a variety of terms without success. Finally I go to some online auto parts store's web site where they have a functional, faceted classification of parts. There I navigate a series of pages to discover the term "tailgate strut", which is what the auto parts manufacturers call this thing. Now if I had started out typing "strut" into Google, I would have drowned in suspension parts, not what I wanted. "tailgate strut" works, but who knew that term? Not me. The bottom line is that people need interfaces that help them learn the concepts and the terminology that are used by people who solve the problems they have.

Type-in-the-box search is like a big game of "go fish". But "go fish" is a playable game because there are only 13 different card ranks in the deck. On the web, and in life, there are millions of different subjects one might be interested in, and many different ways of labeling them. And someone new to the data or the field will not generally know the terms that are commonly used. Who, without specific training in GIS, would know that the term "coverage" is loaded with special meaning?

An important qualification arises at this point. This problem doesn't generally occur to people as an issue of concern until they are dealing with lots of metadata records, either as a provider (that is, as a person charged with managing a collection of metadata) or as a potential user ("surely the data you want is in geodata.gov somewhere--just look there") For most metadata writers, creating and editing one or a few records, this isn't a compelling problem. Yet the standard tantalizes them with a place to indicate the thesaurus of terms the use, guidelines admonish them to use controlled vocabularies where possible, while at the same time there isn't a clear way to see how any of that helps when you have only one or a few records.

In library science, there's a concept called the reference interview. A patron enters the library and goes to the reference desk, saying something, perhaps in the form of a question, to the reference librarian. The reference librarian listens to the question, thinks about the terms, the collection of resources available, and then restates the question back to the patron. The patron replies by agreeing or by refining the new statement, the librarian asks additional questions to clarify or specify, and in the end the patron has a better statement of the problem, one that uses terms that are commonly used in the field of interest and that matches resources available to the reference librarian. It works.

On the web, we don't often have a person available to help us like this. But I believe we often go through a similar process by reading what we do find and refining our question through learning. We're learning terms that represent concepts and relationships among the concepts.

So in order to exploit controlled keywords (some advocates now like to refer to these as "shared" rather than "controlled" vocabularies), you need three things: a reasonably large collection of documents, large enough that you can't get what you need just by scanning them visually; a set of controlled terms applied to those documents; and a software system that shows potential users the vocabulary--terms and relationships-- along with resources matching the terms. What you get from this is a web-based system that emulates the reference interview a little better than the hunt-and-peck process facilitated by text-matching search engines.

So what does it look like

In USGS I've tried to build two systems that use controlled vocabularies effectively:

USGS Science Topics: This is a catalog of web-accessible scientific information resources, describing mostly web sites and a few publications. Its purpose is to give end-users some (I chose this word carefully) resources that are relevant to topics they select, and to arm the users with enough contextual information that they can select the right topics. The catalog records themselves are very thin metadata records stored in a relational database, containing citation information and keywords.
Mineral Resources Online Data Catalog: This is a catalog of scientific data resources emphasizing earth science information relevant to mineral resource studies, a broad commission that includes geophysics, geochemistry, geologic maps, and mineral deposit locations and characteristics. Most of the resources are publications and thus have well-defined authorship, so in addition to the scientific topic categories, one can find related information by the authors, with some interesting results (see for example https://mrdata.usgs.gov/catalog/author.php?author=Kucks)

In each case, the primary user interface is a term browse page like https://www2.usgs.gov/science/science.php?term=810 in which the term you're on (Ocean characteristics) is clarified with a short scope note "The attributes and process of seas and oceans.", and related to other concepts broader ("Earth characteristics"), narrower ("Ocean temperature"), or related but not the same type of thing ("Marine geology"). Everything above the blue line is intended to help you decide whether this is the subject that you want, and if not, to help you decide where else to look, more specifically, more generally, or tangentially. Below the blue line are resources that we have that are relevant to the topic you're on. In each case, you see a title linking to the resource, with a short description below it (not an abstract--the only purpose of that short description is to help you decide whether or not you want to click on the link). For digging a bit deeper, a link "[More info]" appears next to the primary link; this shows you to the catalog record itself, with the option to see the record as formal metadata as well.

It should be noted that this is not a classification. Any resource will show up under whatever categories it has been assigned, not just one. Moreover because our primary vocabulary was constructed carefully to preserve meaningful hierarchical relationships, a resource will appear under any term broader than the term that was assigned to it, so the interface tends to collect resources upward and reduce them downward in the term hierarchy. This is often called "drill-down", implying that you see fewer, more specific things as you choose narrower topics.

One of the things that the Science Topics catalog accomplished for our organization was to enable a wide variety of scientific resources to be presented in a web space that is frequently subject to contention--as Steve Krug writes, everybody wants a piece of the organization's home page. And it accomplished this without making end-users learn the organizational structure of the USGS; normally web development tends to produce sites that focus on the work each organizational unit does.

Back to metadata generally

I think for many people it isn't going to be practical for them to fully exploit these ideas in their own organizations or web sites, because the collection of information they manage is too small or their ability to make the descriptions (metadata) consistent is limited. So I don't think it helps metadata writers to worry too much about the keywords. It's not a bad thing to put some in, and to look for relevant thesauri to use. But the choices you make matter most when the collection you manage is large enough to benefit from some categorical organization.

Thesauri

A lot of people have never heard of formal thesauri in the sense that term is used by librarians, they assume that the word "thesaurus" means Roget's list of synonyms. Formal thesauri are more interesting because they include hierarchical relationships as well as "use-for" terms. We developed a formal thesaurus that is broad and shallow and focuses on natural sciences yet includes both geological and biological concepts. It can be browsed by itself at https://www2.usgs.gov/science/about/tab-term.html We also use the Alexandria Digital Library's Feature Type thesaurus and a thesaurus I built from a few common types of geographic areas (countries, states, counties, map quadrangles at three different scales, hydrologic units at three levels of detail, all cross-linked).