Biodiversity needs better data archiving

Aug 18, 2021 |

Missing metadata — data that provides information about other data — might not sound like a big deal, but it’s a costly problem that’s hindering humanity’s plans to protect the planet’s biodiversity.

“The way I see it, it’s pretty simple,” said Rachel Toczydlowski, a postdoctoral researcher in Michigan State University’s College of Natural Science. “If we want to monitor and conserve global genetic diversity — the most fundamental level of biodiversity — we need to improve our data archiving practices ASAP.”

MSU postdoctoral researcher Rachel Toczydlowski. — EEB postdoc Rachel Toczydlowski.

Put another way, if humans want to be better stewards of their planet, they need to be better stewards of their data when cataloging organisms.

Toczydlowski is the lead author of a new study published August 16 in the journal Proceedings of the National Academy of Sciences, which features researchers from 14 institutions in three countries. The team audited the largest global repository for storing genetic sequence data to see if the entries included basic metadata needed to make them useful for monitoring genetic diversity. More than half of the datasets they examined were missing that metadata, such as when and where a sample was collected.

When properly archived, these datasets allow researchers to track genetic diversity, which is a barometer of how well organisms are equipped to deal with a changing planet.

“Just as an ecosystem can be made up of thousands of species, every individual plant or animal has thousands of genes in its genome that help it to adapt and survive in its unique environment,” Toczydlowski wrote with Eric Crandall, the senior author on the study and an assistant research professor of biology at Pennsylvania State University.

Organisms with lots of genetic diversity are, thus, very adaptable. Those lacking genetic diversity are more vulnerable to changing conditions, such as warming and drying environments, the appearance of an invasive species and poor health resulting from inbreeding. Genetic diversity affects the health of species, which in turn affects the health of ecosystems. Having diversity across all these levels is critical for a healthy planet, Toczydlowski said.

Genetic diversity is invisible to the naked eye unlike the diversity seen in the number of different species swimming in this photo of a coral reef but monitoring it can help protect the planet's biodiversity. — Genetic diversity is invisible to the naked eye — unlike the diversity seen in the number of different species swimming in this photo of a coral reef — but monitoring it can help protect the planet’s biodiversity. Credit: Hiroko Yoshii/Unsplash

Researchers therefore want to know how much genetic diversity is in a given place at a given time to understand the health of those organisms and their environment. Tracking changes in genetic diversity over time would also let ecologists forecast how ecosystems will fare in the future and prepare accordingly. Conservationists, for example, could use the information to determine which organisms would be best suited to launch successful restoration efforts in disrupted ecosystems. But that goal can be met only if the available data are complete.

To get an idea of how much metadata was missing, the team surveyed thousands of data sets from the International Nucleotide Sequence Database Collection — the largest data repository of its kind — representing more than 325,000 individual organisms from nearly 17,000 different species. The researchers found that 86% of these samples were missing important metadata.

“Researchers spend incredible amounts of time and money to generate genomic sequence data, and these data can provide novel insights into basically every field of biology, from conservation to ecology to behavior to evolution,” said Gideon Bradburd, an assistant professor in the Department of Integrative Biology and a co-author on the study. Toczydlowski works in Bradburd’s lab, which is also part of MSU’s Ecology, Evolution and Behavior Program.

“But, if the context of the data — the location and time at which individuals are sampled — is dissociated from these genetic resources, they become much less useful,” Bradburd said. “Especially for conservation monitoring.”