Editorial: How to convert mp's HTML file back to parseable text

A user outside USGS asked:
I wish to import metadata into ArcGIS using the metadata importer, however, the metadata is in the HTML output from 'mp'. Is there any way to convert this html output back to parsed text?

Reply by Peter Schweitzer on 29 August 2003:

This situation sometimes occurs when someone has already created the metadata and run it through mp to generate HTML but has then discarded, lost, or simply not passed on to you the original text file. There are a couple of different ways to cope with this. The cleanest and most likely to produce the best results is also the strangest because it involves running a very small specialized program and using cns with a config file. It's important to note that this procedure works only on HTML written just like mp writes it. I see that it will work on metadata exported from ArcCatalog 8.3 using the stylesheet FGDC CSDGM (HTML). Note also that this won't work for the FAQ-style HTML.

  1. Save a copy of the HTML file (I'll call it catfish.html)
  2. Open catfish.html with Notepad. You'll see HTML code.
  3. Delete everything before the first <dl> tag.
  4. Move to the end of the file.
  5. Delete everything after the last </dl> tag.
  6. Save the file as catfish.txt
  7. Get this file: https://geology.usgs.gov/tools/metadata/tools/bin/detag.exe and save it in the directory where you stored catfish.txt.
  8. Open a command-prompt window, and change to that directory using the cd command. I'm pretending that this directory is c:\data
  9. That strange little program is called detag. Run it:
    C:\data> detag catfish.txt
  10. The output is in catfish.txt.out. You can use Notepad to compare this to catfish.txt, but don't make any changes to it.

    Mostly what detag does is remove the HTML markup tags. But it also finds the element names in the metadata and puts @@ before each one to make it easier for cns to recognize them. So the next task is to tell cns that it should look for those @@ markings.

  11. Open Notepad and create a new file containing only these two lines:
        input
            prefix @@
    
    Save this file in the same directory using a name like fix.cfg.
  12. Now run cns:
    c:\data> cns -c fix.cfg catfish.txt.out -o catfish.met -e leftover.txt
    
  13. Using Notepad, inspect the file leftover.txt. It should be empty. If it isn't, then the situation is more complicated and you need to understand the more general function of cns. Consult a tutorial on cns or perhaps contact someone who understands cns for help.
  14. With no leftovers, the output catfish.met should be reliably parseable. Check this by running mp on it:
    c:\data> mp catfish.met -e catfish.err
    
    This puts error messages into catfish.err. Using Notepad, look at catfish.err. The last line in this file should be the summary of errors. It should not show any unrecognized or misplaced element errors. If it does, then the metadata will probably not import properly into ArcCatalog.
  15. If all went well, you can now import catfish.met into ArcCatalog, and you can delete
        catfish.err
        catfish.txt
        catfish.txt.out
        leftover.txt
    
    If you have any other files to process in this way, keep detag.exe and fix.cfg, otherwise you can delete these also.

It is also possible to convert the file by saving the HTML as plain text and running cns on that. But you'll have to understand how cns works, and you'll need to check the input and output of cns carefully to clean up any confusing elements.

Still another approach is to open a metadata editor such as ArcCatalog or Tkme and carefully copy the metadata elements from the web browser to the appropriate fields in the metadata editor. While this affords you the opportunity to examine the metadata in great detail, it is also the slowest method and the one most fraught with the possibility of error.