In the era of Big Data, we see unprecedented volume, complexity, and creation speed of biological data. Still, scientific discovery still depends heavily on individual intuition. Biologists, who were once overwhelmed by the sheer complexities of the biological phenomenon they seek to answer, now juggle vast volumes of (often unstandardized) data. Further, all data needs a context- provided in a way that doesn’t burden computational biologists, but instead supports us.
The context here is Metadata that comes along with all omics datasets. Metadata is often called data about data or information about information. These information points are often overlooked when it comes to standardization, despite offering significant benefits in terms of understanding biological data in new ways.
Making Life for Fellow Biologists Easier
Whilst conducting comparative omics studies, I have often struggled with and wasted considerable time on untangling the discrepancies and differences in not only the type of biological information at hand but on the technical front, the various file types and missing information in the metadata files have proved to be a challenge to integrate. Once our datasets are done and dusted with the publication phase, we hardly stop to ponder whether the science that we do is reusable or not. What we want to leave behind as a legacy with our own research is something better than the status quo.
Now, with Machine Learning approaches beginning to foray into the Biological Sciences, it is important to note that the omics datasets that we generate and publish might not contribute as data points to future research on account of being unreadable by machines. Further, future machines might not be able to find the data that we painstakingly generate and analyze if our metadata is inconsistent. And that’s a dismal thought given that we spend a considerable amount of hours working towards solutions to various biological problems each of us is passionate about. However, there is hope- consistent, curated Metadata that is findable and machine-readable is easy to achieve. Here are some ways to make your data more ‘Findable’.
1. Assign Metadata with a Globally Unique and Persistent Identifier
This principle is arguably the most important because it will be hard to achieve other aspects of FAIR without globally unique and persistent identifiers. Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset. In this context, identifiers consist of an internet link (e.g., a URL that resolves to a web page that defines the concept such as a particular human protein).
Many data repositories will automatically generate globally unique and persistent identifiers to deposited datasets. Identifiers can help other people understand exactly what you mean, and they allow computers to interpret your data in a meaningful way (i.e., computers that are searching for your data or trying to automatically integrate them). In addition, identifiers will help others to properly cite your work when reusing your data.
2. Describe Your Data with Rich Metadata
In creating FAIR digital resources, metadata can (and should) be generous and extensive, including descriptive information about the context, quality, and condition, or characteristics of the data. Rich metadata allows a computer to automatically accomplish routine and tedious sorting and prioritizing tasks that currently demand a lot of attention from researchers.
The rationale behind this principle is that someone should be able to find data based on the information provided by their metadata, even without the data’s identifier. Rich metadata implies that you should not presume that you know who will want to use your data, or for what purpose. So, as a rule of thumb, you should never say ‘this metadata isn’t useful’ ; be generous and provide it anyway!
3. Explicitly Include the Identifier of the Data that is Described in the Metadata
This is a simple and obvious principle but of critical importance to FAIR. The metadata and the dataset they describe are usually separate files. The association between a metadata file and the dataset should be made explicit by mentioning a dataset’s globally unique and persistent identifier in the metadata. Many repositories will generate globally unique and persistent identifiers for deposited datasets that can be used for this purpose.
4. Register and Index Your Metadata in a Searchable Resource
Identifiers and rich metadata descriptions alone will not ensure ‘findability’ on the internet. Perfectly good data resources may go unused simply because no one knows they exist. If the availability of a digital resource such as a dataset, service, or repository is not known, then nobody (and no machine) can discover it. There are many ways in which digital resources can be made discoverable, including indexing. For example, Google sends out spiders that ‘read’ web pages and automatically index them, so they then become findable in the Google search box. This is great for most ordinary searchers, but for scholarly research data, we need to be more explicit about indexing.
Metadata is extremely valuable as a search and retrieval enhancing mechanism by enabling users to target a query on a certain field. To be FAIR, metadata should be represented using a formal knowledge representation language, and they should use ontologies that follow the FAIR principles to standardize the metadata attributes and their values. These aspects help to ensure interoperability of the metadata and are crucial for finding online datasets based on their metadata. The tooling available to scientists who author metadata should impose appropriate restrictions on the metadata. For example, wherever a value should be a term from a specific ontology, the metadata author should be presented only with options that are valid terms from that ontology when filling in metadata.