Data Science Tools
Good data stewardship leads to better science.
Our data science tools support researchers, data managers, and computing infrastructure developers in their management and analysis of ecological and environmental data.
Because we value and promote open science, our tools are free to use or download, typically run on several popular computing platforms, and utilize best-of-class open source software frameworks when appropriate.
General Environmental Data
DataONE
Search for environmental data within the federation of DataONE, an international network of environmental data repositories.
KNB Data Repository
Access thousands of environmental datasets through the Knowledge Network for Biocomplexity (KNB), a national network that facilitates ecological and environmental research.
Topic-Specific Data
Arctic Data Center
A data and software repository for Arctic research, especially that associated with the National Science Foundation's Polar Program.
Botanical Information and Ecology Network (BIEN)
Datasets and cyberinfrastructure for botanical research across North and South America.
Global Population Dynamics Database
An extensive collection of time series data from plant and animal populations.
Interaction Web Database
Data concerning ecological interactions, particularly pollination/pollinator relationships.
Paleobiology Database
Fossil information that includes 52,000 collection records and 511,889 taxonomic occurrences from 13,962 published references.
Vegbank
The vegetation plot database of the Ecological Society of America's Panel on Vegetation Classification.
Site-Specific Data
GulfWatch Alaska
Datasets from 25 years of research following the Exxon Valdez oil spill in Prince William Sound, Alaska.
OBFS Data Registry (Organization of Biological Field Stations)
The primary source for comprehensive information about scientific and research datasets collected within or under the auspices of the Organization of Biological Field Stations.
SANParks Data Repository (South African National Park)
The primary source for comprehensive information about scientific and research data sets collected throughout the South African National Park System.
UC Natural Reserve System Data Registry
The primary source for comprehensive information about scientific and research datasets collected within or under the auspices of the University of California's Natural Reserve System (NRS).
Community Dynamics Metrics (CODYN)
An R package for analyzing long-term ecological community datasets.
DataONE R Client
This tool provides seamless access from within the R system for statistical analysis to data and metadata held in the DataONE Federation of data repositories. Researchers can read data from any DataONE Member repository using its globally unique identifier, making it accessible within R scripts in a way that is portable across computers. Derived data can also be documented and uploaded to the KNB and other DataONE repositories that support data upload.
Kepler
This reliable, open-source scientific workflow system enables scientists to design workflows and execute them efficiently. Researchers can mix together analysis and modeling steps that use a wide variety of computing engines, such as R, Matlab, and python. Kepler facilitates access to a broad range of ecologically relevant data that are housed in the KNB (Knowledge Network for Biocomplexity), while also providing a basis for sharing analyses through a library of executable components and workflows.
R Packages
The following tools can all be accessed via GitHub.
arcticadatautils
Utility functions in R for processing data for the Arctic Data Center.
datapack
The datapack R package provides an abstraction for collating heterogeneous collections of data objects and metadata into a bundle that can be transported and loaded into a single composite file. The methods in this package provide a convenient way to load data from common repositories such as DataONE into the R environment, and to document, serialize, and save data from R to data repositories worldwide.
datamgmt
The datamgmt R package supports management of data packages on the Arctic Data Center and State of Alaska's Salmon and People (SASAP) data portals.
dataone
Provides read and write access to data and metadata from the DataONE network of data repositories, including the KNB Data Repository, Dryad, and the NSF Arctic Data Center.
recordr
Provides an automated way to capture data provenance for R scripts and console commands without the need to modify existing R code.
Metadata
Morpho
Create, manage, and share Ecological Metadata Language (EML) and their associated data. This is an easy-to-use, cross-platform application for accessing and manipulating metadata and data locally and on the network through powerful connections with Metacat.
To enable open science, we make our data infrastructure available for use by other computing infrastructure developers.
Repository Development
Metacat
Store and version data and associated metadata using a wide variety of standards. Specialized features guarantee local autonomy and access control, while also affording the possibility of broad-scale replication and information sharing as a Member Node in the DataONE network. Metacat servers are used as the basis of the KNB and DataONE networks, as well as many other repositories, including PISCO, GulfWatch Alaska, Taiwan Ecological Research Network, and others.
Metacat UI
An open-source user interface for data repositories, currently used for the KNB respository, Arctic Data Center, DataONE federation, and other repositories.
Metadata Specifications
Ecological Metadata Language (EML)
A metadata specification for describing tabular (relational) datasets that are common in ecology and the environmental sciences. EML can be used in a modular and extensible manner to document ecological data, including a description of the purpose and contents of a dataset, methods used to collect it, people responsible for the data, and details of how to interpret data tables properly.
OBOE (Extensible Observation Ontology)
A semantic model designed to accurately describe observational data in sufficient detail to enable logic-based machine reasoning to help scientists with common research tasks, such as finding and merging datasets. Click here for the historical site.