This blog post is the second in a four-part series called ‘Current Trends in Open Data’. In our last post, we discussed the importance of data management planning. Upcoming posts will focus on data repositories and the effects of the COVID-19 pandemic on open data.
Biomedical research is not exempt from the big data revolution. Next-generation sequencing technologies produce vast amounts of biological data and are now as commonly used as wet-lab techniques. This has helped accelerate the process of drug discovery, among other applications. More importantly, data obtained from high-throughput technologies can potentially serve as a resource for the scientific community as a whole, if properly stored and shared.
Several open data repositories host various types of omics data such as the Gene Expression Omnibus (GEO), Metabolights, etc., and can be freely accessed. Yet, the practice of data sharing has not been readily adopted by researchers. What are the main barriers that hinder data sharing?
Barrier #1: Researcher attitudes
What if my ideas are stolen? What if my data is used and I don’t get credit? These are some common questions on the minds of researchers that make them wary of sharing their data in public repositories. Their concerns are fair. A lot of time and effort goes into conducting scientific experiments and credit should be given where it is due.
An easy solution would be to incentivize data sharing. Similar to citations for articles/publications, journals could provide citation metrics for datasets as well. However, this would barely scratch the surface. To ingrain data sharing as a part of the research culture, policy-level changes would have to be implemented. Major stakeholders such as funding agencies and universities should require data management plans to be put in place and followed. Journals should mandate the submission of datasets along with the research manuscript. Conducting outreach activities and awareness programs at labs would help spread the word about the importance of data sharing in an increasingly data-intensive research environment.
Barrier #2: Patient confidentiality and privacy concerns
This specifically concerns research carried out in clinical settings. Human subject data may contain sensitive health information. The Health Insurance Portability and Accountability Act (HIPAA) provides protections to conceal patient identity and ensure privacy. However, sharing human patient data in public sources could still raise doubts over possible re-identification. Patient consent would also be required to share this type of data and may not be granted in all cases.
De-identifying or anonymizing clinical data could help prevent patient re-identification. The trust model of data sharing was devised to disseminate clinical data for research purposes and makes use of a person’s electronic health records. The intent is to protect patient privacy by eliminating information that could be traced back to the patient and maximize on the content that can be useful to researchers.
Barrier #3 Lack of infrastructure
Next-generation sequencing technologies such as whole genome sequencing produce hundreds of gigabytes of data. Datasets of this size have to be curated and annotated with metadata in a standardized format to ensure that they can be of future use. Numerous open access repositories allow storage and sharing of datasets as previously mentioned. However, there are no quality control checks or standardization protocols in place currently.
Emphasis on data harmonization and machine-actionability should be promoted. The FAIR framework was drawn up keeping in mind the pressing need for improved infrastructure to facilitate data reuse. Data archives and repositories should promote the implementation of the FAIR principles to streamline the process of data discoverability.