Skip to main content

National Center for Ecological Analysis and Synthesis

Aaron Ellison's essay is a thoughtful examination of challenges and opportunities for archiving ecological data. I applaud his strong community infrastructure perspective. Ecologists need to engineer and sustain a community informatics architecture for describing, indexing, curating and accessing data sets. Not only will expensive and potentially reusable information be harnessed, but the ecological community as a whole will benefit from greater visibility in an increasingly networked world where data sets, models, software, simulations, analysis and visualization engines, abstracts, reviews, and discussion groups are joining research publications in a multidimensional matrix of hyperlinked intellectual products.

As Ellison emphasizes, metadata are the key to describing, archiving and retrieving data sets. Standards for description of the contents of ecological data files are needed to make metadata records consistent and meaningful. In addition to other useful scientific functions, metadata solve several problems in the area of information discovery in a networked environment. For example, they eliminate the problem of facing 20,000 hits from a Yahoo or AltaVista query by making searches more precise and structured.

Many data sets age and decompose on obsolescent media. In my office, I have research data archived on: rolls of teletype paper tape, 80-column punch card decks, 9-track tapes, paper printouts, 3.5" and 5.25" floppies, and 3.5" magneto-optical disk cartridges, collectors notebooks, file drawers of data sheets, reprints, and a few hundred little yellow-orange boxes of color transparencies. New classes of storage media always lurk around the corner--lately the generation time of permanent storage media feels like it is approaching 2-3 years. Merchandisers are now gleefully preparing to convince you to upgrade your PC to 5GB DVD disks, while the long-term archival community throws up its arms about the permanence of any digital medium.

Yet, I do not think that the lack of metadata standards, accelerating changes in storage media, or even the absence of an archive office are responsible for the 50 years the ecological community has punted on the issue of community data archives. The basis for this procrastination is probably more personal. Once we publish research papers, we do not like to go back and dredge data sets from the obscurity and safe haven of the lab and prepare them to be paraded publicly.

In some cases this is probably just a consequence of a lack of time in pressing schedules. Or it might be that there is little bliss in resurrecting studies and data that once deserved our finest intellectual effort, but which now seem dated or passé after a research paradigm has moved on. For ecological paradigm pioneers who sally into the intellectual wilderness, going back to spent data is as stimulating as decaffeinated camp coffee (in Santa Barbara that would be a double decaf latté). Most ecologists are not data hoarders or particularly proprietary about their data -- they just have little incentive to take on the burden of putting yesterday's data from yesterday's paradigm into a public data space where it can be freely dissected.

That's where the coercion comes in. The ESA and other ecological societies should make the submission of data sets a prerequisite to the review of research publications. This is not a new idea and not as heavy-handed as it sounds. It is a strategy that has proven successful in other communities as Ellison points out. In addition to the obvious advantage of making data accessible, the exercise of preparing data sets for access at the time of publication would have additional benefits. Among them would be that the timing of the process would obviate the problems associated with old data, because all of the data sets would be current. Also, the exercise of authoring metadata would require researchers to clearly define their research protocols, semantics and data handling practices. That would undoubtedly lead to discussions about data collection and analysis techniques -- and that would mean better science.

A metadata-based ecological community data archive is a necessary step. No reasonable person would object to archives of data of a monitoring or time-series nature, such as those from long-term climate observations, periodic assessments of population sizes or from species diversity inventories. The utility of historical baseline data is intuitive to even the least scientific segments of society and archives of such information are frequently maintained as a public good by government agencies or through long-term research funding.

However, in a broader context of what lies ahead for ecology in a networked world of essentially instantaneous access to richly documented and linked intellectual products, it would be useful to focus community informatics efforts on other ubiquitous and more personal classes of data. For lack of a better label, we might call them career data sets. By that I mean, the entire data stream that an ecologist generates during the course of a career -- from student projects, to dissertation work, faculty research, books, and synthetic reviews. This corpus would include data that are the basis of research reports as well as those from unpublished negative results. It would include research observations with a time course of a few hours, days, weeks or field seasons, and data from preliminary studies of the student kind (e.g. field projects like those that are the signature of Organization for Tropical Studies courses). It would also include qualitative descriptions, such as collector's field notebooks, as well as non-text data like images and sound.

The ecological community would greatly benefit from simple, effective and easy-to-use software tools aimed at researchers and students for authoring metadata descriptions. The best strategy might be to centralize coordination for standards and tool development but personalize the actual curation of metadata. Aggregation of data into a single, but perhaps distributed, archive makes sense from an administrative and archival perspective, but the intellectual cataloging tasks might best stay with the data authors as they advance through their careers. Corrections or updates to data sets, metadata authoring, maintenance of links to reviews and follow-on studies, and new terminology for indexing or description are all examples of functions best handled by researchers themselves. Without software tools for collaborative curation of a community data archive to enable crosslinking to new research findings, data sets will fossilize by becoming less relevant.

Finally, it will be difficult to justify the cost of the informatics infrastructure needed to accomplish this mission if data sets and their metadata are not maintained as part of the knowledge stream. Compare the budget of your university library to that of the university archive. The significant difference is not breadth of the two kinds of collections, but in their contrasting objectives of circulation and engagement on the one hand, versus preservation and protection on the other.