-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathREADME.Rmd
More file actions
236 lines (166 loc) · 7.89 KB
/
README.Rmd
File metadata and controls
236 lines (166 loc) · 7.89 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# FLASH-MM
<!-- badges: start -->
<!-- badges: end -->
FLASH-MM is a method (package name: FLASHMM) for analysis of single-cell differential expression using a linear mixed-effects model (LMM) [^1]. The mixed-effects model is a powerful tool in single-cell studies due to their ability to model intra-subject correlation and inter-subject variability.
The FLASHMM package provides three functions: *lmm*, *lmmfit*, and *flashmm*, for fitting LMMs. The *lmm* function takes summary statistics as input, whereas *lmmfit* and *flashmm* are a wrapper around *lmm* that directly processes cell-level data and computes the summary statistics internally. The *lmmfit* takes the model design matrices as input. Instead the *flashmm* takes the model formulas as input, where the design matrices are generated internally. While *lmmfit* and *flashmm* are easier to use, they have the limitation of higher memory consumption. For extremely large scale data, we can precompute the summary-level data, and then use *lmm* function to fit LMMs. FLASHMM provides *lmmtest* function to perform statistical test on the fixed effects and the contrasts of the fixed effects.
In summary, FLASHMM package provides the following main functions.
* *lmm*: fit LMM using summary-level data.
* *lmmfit*: fit LMM using model design matrices based on cell-level data.
* *flashmm*: fit LMM using model formulas based on cell-level data.
* *lmmtest*: perform statistical tests on the fixed effects and their contrasts.
* *contrast.matrix*: construct a contrast matrix of the fixed effects for various comparisons.
* *simuRNAseq*: simulate multi-sample multi-cell-type scRNA-seq data.
## Installation
You can install FLASHMM package from CRAN:
```{r echo = TRUE, results = "hide", message = FALSE}
install.packages("FLASHMM")
```
Or the development version from GitHub:
```{r echo = TRUE, results = "hide", message = FALSE}
devtools::install_github("https://github.com/Baderlab/FLASHMM")
```
## Example
This basic example shows how to use FLASHMM for analyzing single-cell differential expression. See [the package vignette](https://cran.r-project.org/package=FLASHMM) for details.
```{r}
library(FLASHMM)
```
### Simulating scRNA-seq dataset with *simuRNAseq*
Simulate a multi-sample multi-cell-cluster scRNA-seq dataset that contains 25 samples and 4 clusters (cell types) with 2 treatments.
```{r dataset}
set.seed(2412)
dat <- simuRNAseq(nGenes = 50, nCells = 1000, nsam = 25, ncls = 4, ntrt = 2, nDEgenes = 6)
names(dat)
```
The simulated data contains
* *counts*: a genes-by-cells matrix of expression counts
* *metadata*: consisting of samples (sam), cell-types (cls) and treatments (trt).
* *DEgenes*: differetially expressed (DE) genes.
### Differential expression analysis using LMM
The analyses involve following steps: LMM design, LMM fitting, and hypothesis testing.
**1. Model design**
```{r}
##Gene expression profile: log-transformed counts
Y <- log2(dat$counts + 1)
##Fixed effects
fixed <- ~ 0 + log(libsize) + cls + cls:trt
X <- model.matrix(fixed, data = dat$metadata)
##Random effects
randomS <- ~ 0 + sam
Z <- model.matrix(randomS, data = dat$metadata)
d <- ncol(Z)
```
**2. LMM fitting**
**Option 1**: Fit LMMs using model formulas based on cell-level data.
```{r}
fit1 <- flashmm(dat$counts, fixed, random = randomS, metadata = dat$meta, is.counts = TRUE)
```
**Option 2**: Fit LMMs using model design matrices based on cell-level data.
```{r}
fit2 <- lmmfit(Y, X, Z, d = d)
identical(fit1, fit2)
```
**Option 3**: Fit LMMs based on summary-level data.
```{r}
##Step 1: computing summary statistics
n <- nrow(X); d <- ncol(Z)
XX <- t(X)%*%X; XY <- t(Y%*%X)
ZX <- t(Z)%*%X; ZY <- t(Y%*%Z); ZZ <- t(Z)%*%Z
Ynorm <- rowSums(Y*Y)
##Step 2: fitting LMMs
fit3 <- lmm(XX, XY, ZX, ZY, ZZ, Ynorm = Ynorm, n = n, d = d)
identical(fit2, fit3)
```
**3. Hypothesis testing**
```{r, echo = TRUE, message = FALSE, tidy = TRUE, tidy.opts = list(width.cutoff = 80)}
##Testing coefficients (fixed effects)
fit <- fit1
test <- lmmtest(fit)
#head(test)
##The t-value and p-values are identical with those provided in the LMM fit.
range(test - cbind(t(fit$coef), t(fit$t), t(fit$p)))
fit$p[, 1:4]
#fit$coef[, 1:4]; fit$t[, 1:4]
```
**Differentially expressed (DE) genes**: The coefficients of the interactions, cls*i*
:trtB, represent the effects of treatment B versus A in a cell type, cls*i*.
```{r}
##Coefficients, t-values, and p-values for the genes specific to a cell-type.
index <- grep(":", rownames(fit$coef))
ce <- fit$coef[index, ]
tv <- fit$t[index, ]
pv <- fit$p[index, ]
out <- data.frame(
gene = rep(colnames(ce), nrow(ce)),
cluster = rep(rownames(ce), each = ncol(ce)),
coef = c(t(ce)), t = c(t(tv)), p = c(t(pv)))
##FDR.
out$FDR <- p.adjust(out$p, method = "fdr")
##The DE genes with FDR < 0.05
out[out$FDR < 0.05, ]
```
**Using contrasts**: We can make comparisons using contrasts. For example, the effects of treatment B vs A in all clusters can be tested using the contrast constructed as follows.
```{r}
ct <- numeric(ncol(X))
index <- grep("B", colnames(X))
ct[index] <- 1/length(index)
test <- lmmtest(fit, contrast = ct)
head(test)
```
## And More
### Using ML method
To use the maximum likelihood (ML) method to fit the LMM, set method = ‘ML’.
```{r LMM_ML, echo = TRUE, message = FALSE, warning = FALSE}
##Fitting LMM using ML method
fit1 <- lmmfit(Y, X, Z, d = d, method = "ML")
```
### LMM with two-component random effects
If appropriate, for example, we also take account of the measurement locations as a random effect, we may fit data using the LMM with two-component random effects.
```{r, echo = TRUE, message = FALSE, tidy = TRUE, tidy.opts = list(width.cutoff = 80)}
##Two-component random effects: Suppose the data contains different measurement locations within a sample, denoted as 'location', which are randomly generated.
set.seed(2508)
dat$metadata$location <- sample(LETTERS[1:25], nrow(X), replace = TRUE)
randomL <- ~ 0 + location
Zl <- model.matrix(randomL, data = dat$metadata)
d <- c(ncol(Z), ncol(Zl))
##Fit the LMM with two-component random effects.
fit2 <- lmmfit(Y, X, Z = cbind(Z, Zl), d = d, method = "ML")
fit3 <- flashmm(Y, fixed, random=list(randomS, randomL), metadata = dat$metadata, method = "ML")
identical(fit2, fit3)
```
### Testing variance components
We use both z-test and likelihood ratio test (LRT) to test the second variance component in the LMM with two-component random effects. Since the simulated data was generated by the LMM with single-component random effects, the second variance component should be zero. For the LRT test, the two nested models must be fitted using the same method, either REML or ML, and use the same design matrix, $X$, when using REML method.
```{r}
##(1) z-test for testing the second variance component
##Z-statistics for the second variance component
i <- grep("var2", rownames(fit2$theta))
z <- fit2$theta[i, ]/fit2$se.theta[i, ]
##One-sided z-test p-values for hypotheses:
##H0: theta <= 0 vs H1: theta > 0
p <- pnorm(z, lower.tail = FALSE)
##(2) LRT for testing the second variance component
LRT <- 2*(fit2$logLik - fit1$logLik)
pLRT <- pchisq(LRT, df = 1, lower.tail = FALSE)
##QQ-plot
qqplot(runif(length(p)), p, xlab = "Uniform quantile", ylab = "Z-test p-value", col = "blue")
abline(0, 1, col = "gray")
qqplot(runif(length(pLRT)), pLRT, xlab = "Uniform quantile", ylab = "LRT p-value", col = "blue")
abline(0, 1, col = "gray")
```
```{r}
sessionInfo()
```
# Citation
If you find FLASH-MM useful for your publication, please cite:
[^1]: Xu, C., Pouyabahar, D., Voisin, V. et al. FLASH-MM: fast and scalable single-cell differential expression analysis using linear mixed-effects models. Nat Commun 17, 2384 (2026). https://doi.org/10.1038/s41467-026-69063-2