How was the data set created?

  1. From what previous works were the data drawn?
  2. How were the data generated, processed, and modified?
    • Does it matter when these modifications were made?
    • Did someone other than the formal authors do the processing?
  3. What similar or related data should the user be aware of?

Describing data gathering and processing

The overall genesis of a data set is described as a series of process steps in which previously-created data sources are used and new data sources are built. In the diagram below, data sources A and B were used in process step 1 to create data source C. Data sources C and D were used in process step 2 to create the subject data set. The Lineage in the metadata for the subject data set therefore contains four Source_Information elements and two Process_Step elements.
Process step diagram
So the metadata for the subject data set would contain the following elements:
Data_Quality_Information:
  (other data quality elements ...)
  Lineage:
    Source_Information:
      Source_Citation:
        Citation_Information:
          (more details about the source's citation ...)
      Source_Citation_Abbreviation: A
      (more details about the source ...)
    Source_Information:
      Source_Citation:
        Citation_Information:
          (more details about the source's citation ...)
      Source_Citation_Abbreviation: B
      (more details about the source ...)
    Source_Information:
      Source_Citation:
        Citation_Information:
          (more details about the source's citation ...)
      Source_Citation_Abbreviation: C
      (more details about the source ...)
    Source_Information:
      Source_Citation:
        Citation_Information:
          (more details about the source's citation ...)
      Source_Citation_Abbreviation: D
      (more details about the source ...)
    Process_Step:
      Process_Description: Process 1 ...
      Source_Used_Citation_Abbreviation: A
      Source_Used_Citation_Abbreviation: B
      Source_Produced_Citation_Abbreviation: C
      Process_Date: 1998
    Process_Step:
      Process_Description: Process 2 ...
      Source_Used_Citation_Abbreviation: C
      Source_Used_Citation_Abbreviation: D
      Process_Date: 1998
Note that Source C is produced in one process step and used in another. Note also that the elements Source_Used_Citation_Abbreviation and Source_Produced_Citation_Abbreviation appear only in a Process_Step, while the element Source_Citation_Abbreviation appears only in a Source_Information.

Next:How reliable are the data; what problems remain in the data set?
Previous:Why was the data set created?