Generating data is expensive. The biopharma industry has woken up to this fact and is exploring ways to get access to more data to accelerate drug discovery efforts. For instance, important biopharma players like AstraZeneca, Bayer, Celgene, Janssen Research and Development, Memorial Sloan Kettering Cancer Center, and Sanofi have announced a new data-sharing initiative- Project Data Sphere. For this project, the goal is to share historical cancer research data to aid the acceleration of cancer research. Generally, companies have large volumes of data, but it's not linked and contextualized to enable the creation of reliable AI solutions.
Digital transformation in life science research involves using technology and algorithms to aid discovery processes. It is imperative to make data machine-readable and machine-actionable to derive valuable insights. The efficiency of this process and the scope for collaborative research depends heavily on how data is stored and organized on cloud platforms. However, some common pitfalls in data management make it difficult to find and reuse previously generated data, thus rendering the stored data practically useless. Also, collaboration - one of the main advantages of digital transformation - is hampered due to the inconsistency in data formats and storage. AI/ML applications work most efficiently/best when the digital transformation of data management is complete, i.e., when data generation, storage, and sharing are all streamlined. Here, we discuss ways to meet that end.
In the biopharma industry, data accumulates exponentially, and over time, those data become:
The FAIR (Findable, Accessible, Interoperable, and Reusable) principles provide a set of guidelines for the publication of digital resources, such as datasets, code, workflows, and research objects, to make them FAIR.
Underlining the importance of following FAIR principles right from the point of data reporting, Mendeley has instituted an interesting award - Mendeley Data FAIRest datasets award to reward scientists who share data in a FAIR and reproducible manner! Following the framework makes it easy to find and access relevant data while ensuring consistency in structure and metadata information, thus enhancing usability.
Let us get to know more about how it can add value to the digital transformation workflow.
To be able to re-use previously generated data, a researcher would first have to find data relevant to the research problem at hand. Ideally, every dataset must be associated with a unique identifier and sufficient searchable metadata information. This becomes extremely difficult to manage via a manual process but can be done relatively quickly with an infrastructure and a set of automated checks.
Let’s see how it is done in practice. On Elucidata’s cloud platform, Polly, we provide a unique identifier (called dataset id) to every dataset entering the platform. We have a defined schema mentioning the bare minimum metadata information to be associated. If this bare minimum information is unavailable for a dataset, it is rejected by a set of automated checks. This metadata information, which helps find the suitable dataset, is then indexed and stored in a database on which researchers looking for data can search and query.
Once a relevant dataset is found, researchers should be able to access the underlying data. The data and searchable metadata need to be linked together through a unique identifier to achieve this.
On Polly, the dataset id is associated with the indexed metadata and the underlying data file, making the underlying data accessible once a relevant dataset is found.
FAIR states that once the dataset has been accessed, it should be quickly usable in tools, pipelines, or ML models. In other words, data (and metadata) should be machine-readable. Data becomes machine-readable if all the stored data is in a consistent format (all your tools, pipelines, and algorithms will be built keeping this consistent structure in mind). To make the metadata machine-readable, it has to be normalized using a controlled vocabulary. A machine won’t understand that Coronavirus disease, COVID-19, and infection by SARS-CoV-2 mean the same disease. Hence, all these 3 terms need to be normalized to the same name or id. There are multiple publicly available ontologies that provide consistent vocabulary.
Again leading by an example, on Polly, we have one fixed format for every data type, and we’ve built NLP-based models to normalize the metadata using a fixed set of ontologies. Polly’s disease normalization model normalizes all the COVID-19-related disease terms to a unique id (D000086382) and unique name (COVID-19).
The biopharma and life science industries are inherently data-rich. The race to improve quality and manage costs has nudged the industry from being data-rich to data-driven. Following the FAIR principles in data management could be a significant factor that can make or break your digital transformation journey!
“There is no alternative to digital transformation. Visionary companies will carve out new strategic options for themselves - those that don't adapt will fail.”- Jeff Bezos
Hooked? Of course, you must be!
Come join us at DataFAIR’22 to hear industry leaders talk about their success stories, pain points, and practical aspects. We are sure you'll be able to relate to most parts of it, network with peers, and learn lots about FAIRifying data and accelerating your research.