BIG DATA as an engine for aquatic information creation
Conservation needs will always outpace our ability to meet them, so maximizing returns on investments is crucial. Doing so requires good information to guide investment choices, and data are fundamental to creating that information. Good thing we live in a world that veritably drips with data (graphic 1). Everywhere you look, once you know how to look, it lurks—not at all unlike the classic scene in the Matrix when Neo first perceives Agent Smith’s true nature and the digital artifice that surrounds him (click here for the YouTube refresher). As described last time, eDNA will be an important contributor to, and accelerant of, that data drip (blog #72); as will easy-to-use protocols for collecting stream flow and temperature data (blog #60) in what Porter & colleagues call the “sensor data deluge” (study hyperlinked here). But before collecting yet more data, it’s also important to realize that we’re oftentimes sitting on mountains of the stuff that is seriously underutilized & could be mined to produce high quality information at low cost. Collectively, there are 100s (1000s?) of natural resource agencies and organizations that collect data in, & about, streams and lakes globally, & we have spent many billions of US$ doing so. But a significant impediment to using much of those data is that they are not usable. In fact, it’s frequently the case that no one other than the few people involved in specific collection activities even know the data exist because they live in disparate file cabinets (the data, not the people) or hard-drives outside of real databases. Heidorn calls these “dark data” (study hyperlinked here) and discusses their great potential utility to science, and great risk of being lost forever as we transition from the age of paper to one that is digital.
We all know examples of, or have, dark data that could & should be brought into the light. Unfortunately, there are no easy ways of doing so other than rolling up our collective sleeves and diving in. In the case of stream temperature data in the western U.S., it’s taken our NorWeST temperature team the past 4 years to clean & organize data from >100 agencies into a functional database & we still have 1 more year & several states to go. And that’s just one type of data in one part of the country so there’s just a huge amount of work yet to be done to get our legacy datasets up to snuff. But as different groups work their way through dark matter database development, it’s rather impressive to see what’s out there. Two prime aquatic examples are MARIS (Multistate Aquatic Resource Information System), which hosts >1,000,000 fish sample records for >1,000 species in 25 states; and MapIT (Mapping Application for Freshwater Invertebrate Taxa), which hosts >1,500,000 records at >15,000 national sites for >5,000 aquatic macroinvertebrate species (graphic 2). Also fair aquatic database game to consider here are those that consist of environmental descriptors, like the national StreamCat database recently developed by EPA, or a similar precursor developed by the National Fish Habitat Partnership (graphic 3). Those descriptors are typically remotely sensed or calculated from digital elevation models (DEM) but serve as very useful predictor variables when developing models to analyze or predict patterns in streams and lakes. McManamay & Utz (study hyperlinked here) discuss other notable aquatic databases and how open access to large amounts of data presents opportunities and challenges. Then also there are the seriously large databases like GBIF (Global Biodiversity Information Facility), which contains >600,000,000 species occurrence locations across all taxonomic groups globally. Or the U.S.-centric version of that species database recently started by USGS through their BISON initiative (Biodiversity Information Serving Our Nation).
In addition to archiving disparate datasets for posterity, building high-quality databases provides significant value that is quite tangible. In the case of NorWeST, it has taken $1,000,000 in salaries to maintain a database team for 4 years, but that team has created an open-access database from contributions by hundreds of people that would require $10,000,000 & decades of field work to replicate. The true value of any database, however, isn’t what it costs to collect the data, but rather that it allows data to be handled & summarized efficiently to provide useful information for decision making. A decision might be as simple as foregoing new data collections (& the associated costs) because the desired data already exist in the database. Or if new data are collected, deciding where to collect them so as to avoid redundancy with existing data locations—a straightforward proposition once data sites are easily queried and mapped (graphic 4). But more exciting is that as more databases are developed, it becomes possible to merge them in interesting ways and conduct novel analyses that yield previously impossible insights. As this process proceeds, we’ll increasingly find ourselves not so much limited by data, but by the quality of the questions we ask of data. Poisot & colleagues (study hyperlinked here) argue that if guided by sound hypotheses, such “synthetic datasets” will lead to rapid scientific advances and information creation (graphic 5). That dynamic couldn’t happen at a better time given the climate times we live in, and I’m confident that the collective creativity & ingenuity of fish people will find many ways to capitalize on the world of new opportunities opening to us.