The GEO database started by NCBI GEO is a public repository for the free distribution of next-generation sequencing and other forms of high-throughput functional genomics data submitted by researchers all across the world. There are around 60,000 Microarray and High Throughput Sequencing studies available on GEO however, there is no effective way to find datasets of interest.
The GEO platform provides search functionality that is based on keywords provided by a researcher. The results returned by GEO are diverse and extensive in size, and nearly impossible for a researcher to manually go through. Moreover, the results are based on keywords present in the experiment design or the title of a study. This does not convey the full complexity of a study.
Researchers may have a gene signature of interest which was obtained through an experiment or which is heavily cited in the literature. Here we created a system – AskGEO, that will take this gene signature and find studies in the complete GEO database in which a similar set of genes is co-expressed. In this way, AskGEO searches for relevant datasets takes into account the data present in the data rather than relying on the external information provided by GEO.
The user can further refine the recommendations by giving keywords that are looked for in the publications linked to these datasets. By combining the actual gene expression values from a dataset and the textual information present in the publication linked to that dataset, we can create a powerful tool which further can generate relevant suggestions as is shown by the results obtained using two signatures representative of two different biological conditions which were used to validate this tool.
Solution – AskGEO
Polly provides a query and search engine – AskGEO, that allows the user to find the right datasets for their analysis from its Data Lakes and helps them run analysis on top of those datasets. The developed methodology facilitates the systematic curation and processing of publicly available gene expression datasets from GEO. Here we present a specific engine, AskGEO that runs a signature-based and keyword-based search that helps the user identify studies related to a biological phenomenon from the entire GEO repository.
This search engine AskGEO ensured:
- Standardizing the processing of 40,000 datasets
- Building a gene co-expression database
- Using gene signatures to recommend datasets
- Validating the recommendations for two gene signatures and a random gene signature
- and more…
Get in Touch
Learn the methods used to create the system – AskGEO, which is able to generate recommendations for GEO studies based on a gene signature of interest, while overcoming the bottleneck of normalizing different datasets coming from different sources.
Polly manages the technology so that you can do high-level research. Book a session today to make the most of your work.