Why is FAIR Data Pivotal for Pharma's Digital Transformation?

Generating data is expensive. The biopharma industry has woken up to this fact and is exploring ways to get access to more data to accelerate drug discovery efforts. For instance, important biopharma players like AstraZeneca, Bayer, Celgene, Janssen Research and Development, Memorial Sloan Kettering Cancer Center, and Sanofi have announced a new data-sharing initiative- Project Data Sphere. For this project, the goal is to share historical cancer research data to aid the acceleration of cancer research. Generally, companies have large volumes of data, but it's not linked and contextualized to enable the creation of reliable AI solutions.

The US healthcare system alone could generate up to $100 billion in value annually by using effective big data strategies.- Mc KinseyImage
Source: Link

Digital transformation in life science research involves using technology and algorithms to aid discovery processes. It is imperative to make data machine-readable and machine-actionable to derive valuable insights. The efficiency of this process and the scope for collaborative research depends heavily on how data is stored and organized on cloud platforms. However, some common pitfalls in data management make it difficult to find and reuse previously generated data, thus rendering the stored data practically useless. Also, collaboration - one of the main advantages of digital transformation - is hampered due to the inconsistency in data formats and storage. AI/ML applications work most efficiently/best when the digital transformation of data management is complete, i.e., when data generation, storage, and sharing are all streamlined. Here, we discuss ways to meet that end.

Data Management: A Major Bottleneck in Digital Transformation

In the biopharma industry, data accumulates exponentially, and over time, those data become:

Siloed and Fragmented
Most data is generated for a particular research team, but you may have multiple teams with similar data needs within an organization who are unaware that the data already exists.
Disorganized
Biomedical data is vast and varied. Data generation and consumption methods differ across an organization, leading to inconsistent file formats and data management strategies.
Compromised
Some of the stored data may have some samples missing, incomplete metadata information, or other such problems. This makes it difficult to reuse the data or reproduce the analyses performed earlier.

FAIR Principles to the Rescue!

The FAIR (Findable, Accessible, Interoperable, and Reusable) principles provide a set of guidelines for the publication of digital resources, such as datasets, code, workflows, and research objects, to make them FAIR.

Underlining the importance of following FAIR principles right from the point of data reporting, Mendeley has instituted an interesting award - Mendeley Data FAIRest datasets award to reward scientists who share data in a FAIR and reproducible manner! Following the framework makes it easy to find and access relevant data while ensuring consistency in structure and metadata information, thus enhancing usability.

Let us get to know more about how it can add value to the digital transformation workflow.

Getting to the Relevant Data Using Metadata Information:

‍To be able to re-use previously generated data, a researcher would first have to find data relevant to the research problem at hand. Ideally, every dataset must be associated with a unique identifier and sufficient searchable metadata information. This becomes extremely difficult to manage via a manual process but can be done relatively quickly with an infrastructure and a set of automated checks.

Let’s see how it is done in practice. On Elucidata’s cloud platform, Polly, we provide a unique identifier (called dataset id) to every dataset entering the platform. We have a defined schema mentioning the bare minimum metadata information to be associated. If this bare minimum information is unavailable for a dataset, it is rejected by a set of automated checks. This metadata information, which helps find the suitable dataset, is then indexed and stored in a database on which researchers looking for data can search and query.

Access to the Data:

Once a relevant dataset is found, researchers should be able to access the underlying data. The data and searchable metadata need to be linked together through a unique identifier to achieve this.

On Polly, the dataset id is associated with the indexed metadata and the underlying data file, making the underlying data accessible once a relevant dataset is found.

Making the Data Machine-Readable:

‍FAIR states that once the dataset has been accessed, it should be quickly usable in tools, pipelines, or ML models. In other words, data (and metadata) should be machine-readable. Data becomes machine-readable if all the stored data is in a consistent format (all your tools, pipelines, and algorithms will be built keeping this consistent structure in mind). To make the metadata machine-readable, it has to be normalized using a controlled vocabulary. A machine won’t understand that Coronavirus disease, COVID-19, and infection by SARS-CoV-2 mean the same disease. Hence, all these 3 terms need to be normalized to the same name or id. There are multiple publicly available ontologies that provide consistent vocabulary.

Again leading by an example, on Polly, we have one fixed format for every data type, and we’ve built NLP-based models to normalize the metadata using a fixed set of ontologies. Polly’s disease normalization model normalizes all the COVID-19-related disease terms to a unique id (D000086382) and unique name (COVID-19).

By some estimates, bioinformaticians spend about 80% of their time bringing the relevant data to a usable form and only about 20% of their time deriving insights.
Image source: Link

The biopharma and life science industries are inherently data-rich. The race to improve quality and manage costs has nudged the industry from being data-rich to data-driven. Following the FAIR principles in data management could be a significant factor that can make or break your digital transformation journey!

“There is no alternative to digital transformation. Visionary companies will carve out new strategic options for themselves - those that don't adapt will fail.”- Jeff Bezos

Hooked? Of course, you must be!
Come join us at DataFAIR’22 to hear industry leaders talk about their success stories, pain points, and practical aspects. We are sure you'll be able to relate to most parts of it, network with peers, and learn lots about FAIRifying data and accelerating your research.