The next phrase that continues after this sentence is quite harsh but it is also the one that I have routinely heard about single cell expression data from different scientists over 3 years of working with the data. Single cell differential expression is a sham !! This might sound harsh for a method, instrumental even for the basic cell type identification step, however at times I also have felt similarly. My rational mind however prompted me to see this problem either as a data-method misfit or a lack of understanding of the proper use of methods.
Before talking about the single cell expression methods, I would first like to discuss the issues that underlie the above feeling:
As we can see from the conflicting results in the above figures, genes that have been identified as differentially expressed (by splitting CD8+ T cells into two groups based on the high and low expression of a gene of interest) didn’t show a strong correlation (Pearson) with each other and gene of interest. This is one of the several example off-cases that I have observed while working with single cell DE.
Let’s look if others have also faced similar issues with single cell differential expression. Multiple studies (Wang et al, BMC Bioinformatics, 2019; Dal Molin, Alessandra, Giacomo Baruzzo, and Barbara Di Camillo, 2017, Frontiers in Genetics; Das et al, 2021, Genes; Squair et al, Nature Communications, 2021) have highlighted that the performance of single cell DE methods is subjective and data dependent. Additionally, coherence between the popular DE methods can be very variable as shown by Wang et al, BMC Bioinformatics, 2019.
From these results, it is evident that there is very low concordance between different methods.
As I started exploring methods, their comparative analysis, and correctness, I realized that there are essential inherent properties of single cell data that give rise to challenges in getting accurate differential expression results:
Squair et al, Nature Communications, 2021 tried to identify ways to confront high false positives in single cell differential expression. They reasoned that datasets, where the same population of purified cells has been sequenced for both bulk and single cell sequencing, can be used for understanding the discrepancies. They found 18 such datasets in the public domain and considered them as gold standard datasets for comparison. Collectively they identified that-
Pseudo-bulk methods perform aggregation of expression of cells from different groups within a biological replicate. By doing so the methods thereby reduce the overall zero inflation. Additionally, Murphy et al, Nature Communications, 2022 also showed that pseudo-bulk methods tend to perform better than other methods for single cell DE.
However one can also reason that the performance of the pseudo bulk methods is only comparatively better than the others and still suffers from low performance as evident from the low AUCC for even the pseudo bulk based methods.
Comparing the non-aggregation with the pseudo bulk based methods for single cell DE, the authors identified that false positives identified by the former set of methods are usually amongst the high-expressing genes. Thus even if the actual difference between two groups of cells amongst highly expressed genes is minimal, they can still be falsely identified as differentially expressed genes. Conversely, false negatives overlooked by non-aggregation-based methods are usually lowly expressed genes. To test whether the aggregation has any role to play, the authors avoided the aggregation step in the pseudo bulk methods and considered each cell as a sample to compute differential expression. This gave rise to higher false positives which were from a set of highly expressed genes thus validating their hypothesis.
Single cell DE can allow us to compare groups of cells from different biological replicates and groups. However if the groups of cells are formed without considering the sample origin and replicate from which they are originating, even pseudo-bulk based methods perform worse. The authors mixed cells coming from different replicates into different groups and found that the accuracy of the pseudo-bulk based methods was lost.
From all the studies that I read through, it is evident that we don’t have one method that can fit well for all datasets. Similarly, even the best of methods tend to have some tradeoffs for a set of genes. Based on these studies, the best practices that can be adopted currently for single cell DE are
There are more than 100 DE methods for single cell available right now but only a handful of them are used across most studies. The inherent properties of single cell data require us to look at the statistical base underlying the methods in a new light. This will help us repurpose existing methods/identify new methods to calculate accurate single cell differential expression.
P.S.: This blog is originally a part of blog series on DecodeBox, written by Ayush Praveen who is a Bioinformatics Scientist at Elucidata.
References