## read in data
load("../../_data/attie.core.GRCm39.v1.Rdata")
<- readRDS(url("https://churchilllab.jax.org/qtlviewer/attie/DO500HFD/dl?fileName=LiverLipids_Phenotypes_V11.rds"))
liver_lipids
## save annotations (metadata) and normalized counts
<- as.data.frame(liver_lipids$annot.samples)
sample_meta <- sample_meta |>
sample_meta ::select(ID = MouseID, sex = Sex)
dplyr
<- liver_lipids$data$norm
norm_counts
## clean up
rm(ensembl_version, K, liver_lipids, map, markers)
QTLretrievR Attie DO500 example
Introduction
This is a vignette to demonstrate using QTLretrievR with the Attie DO500 dataset that is available at QTLviewer and published in Nature Communications. We will be using the liver lipids dataset with the mice genotyped using the Neogen GigaMUGA 143k SNP array.
First we will set up our environment by loading QTLretrievR
and tidyverse
.
Next we will load in our data. Then we will do some cleanup so only the objects we need for QTLretrievR are preserved in the environment.
The markers associated with these genoprobs are from the GigaMUGA grid map, so from the .Rdata
file we only need to retain the genoprobs object, since the GigaMUGA gridfile is available through QTLretrievR
. From the liver_lipids
object, we want to retain: - annot.samples
- $data$norm
The sample annotations, and the normalized lipid counts.
Running QTLretrievR
Now that we have our data set up and R environment cleaned up, we’re ready to start running QTLretrievR. We already have our genoprobs in the correct format (founder allele probabilities). To see the general overview of how to impute genotype probabilities from GigaMUGA SNP-level calls, please see our overview vignette.
This demo will show running each of the peak calling, effects calculation, and mediation analysis steps separately. If you wish to run all three steps together, you can use the runQTL
function. If you run this option, keep in mind that the effects and mediation steps require input of a suggestive LOD score threshold, so you should filter again after identifying your suggestive and significant LOD scores.
Mapping & Peak Calling
The following mapping will be performed by passing objects located in the working environment; paths to objects could also be passed if you have them saved as individual rds files. Please note - QTLretrievR
is set up to work with data from multiple tissues. This means that even if you are only working with a single tissue, the genoprobs and phenotype matrices (expr_mats
) need to be passed in list format with the tissue type. In this example, the normalized lipid counts are set up with the lipid phenotypes as columns and the samples as rows, so we have to transpose the count matrix to pass it to mapQTL
. If your counts have already been rankZ transformed and each column is a different phenotype, you can specify rz = T
in the submission and it will skip the transformation.
## Downsample to 200 liver lipids
<- sample(colnames(norm_counts), 200, replace = FALSE)
phenotypes
## Perform QTL Mapping
<- Sys.time()
start_time
<- mapQTL(outdir = "../../_data/",
map_peaks peaks_out = "liver_lipids_GM_peaks_sub.rds",
map_out = "liver_lipids_GM_map_sub.rds",
genoprobs = list("liver" = genoprobs),
samp_meta = sample_meta,
expr_mats = list("liver" = t(norm_counts[,phenotypes])),
covar_factors = c("sex"),
gridFile = gridfile_GM,
save = "ro")
checking probabilities
min_cores not provided using 4 cores as minimum
liver
gmap complete
pmap complete
expression loaded
Working with 369 samples and 200 genes in liver.
rankZ normalized
covariates calculated
calculating peaks
SLURM job detected, using 16 cores (16 tasks, 1 cores per task).
Registering 8 cores and passing 8 cores per tissue to 1 tissue(s). Does that look right? If not please
set total_cores parameter to the number of available cores.
<- Sys.time()
end_time
- start_time end_time
Time difference of 26.05502 mins
You may notice that there were 16 cores found available on our Linux system, but only 8 were registered and used for peak calling. This is because with < 1000 phenotypes, we have found that using more than 8 cores does not appreciably decrease the amount of time used to do the peak calling.
To now briefly look at what the map_peaks
object looks like. QTLretrievR
returns everything in list format separating everything based on tissue. Since we are only using liver lipids, our tissue is liver. If we were using lipid measurements from multiple tissues, each of the lists below would have length > 1 and be named for each tissue.
Each list element of the map_peaks$maps_list
object is a list of the objects for different tissues.
names(map_peaks$maps_list)
[1] "qtlprobs" "covar_list" "expr_list" "exprZ_list" "kinship_loco"
[6] "gmap" "map_dat2" "pmap" "tissue_samp"
str(map_peaks$maps_list$qtlprobs$liver[[1]])
num [1:483, 1:8, 1:9449] 1.26e-03 8.19e-03 8.20e-03 8.20e-03 8.47e-08 ...
- attr(*, "dimnames")=List of 3
..$ : chr [1:483] "DO022" "DO023" "DO024" "DO025" ...
..$ : chr [1:8] "A" "B" "C" "D" ...
..$ : chr [1:9449] "UNC6" "UNCHS000001" "JAX00000010" "JAX00000010r" ...
str(map_peaks$maps_list$kinship_loco$liver[[1]])
num [1:483, 1:483] 0.553 0.13 0.149 0.13 0.156 ...
- attr(*, "n_pos")= int 114433
- attr(*, "dimnames")=List of 2
..$ : chr [1:483] "DO022" "DO023" "DO024" "DO025" ...
..$ : chr [1:483] "DO022" "DO023" "DO024" "DO025" ...
maps_list
preserves the following for future use in either plotting functions, or if you want to re-run a specific tissue or phenotype.
qtlprobs
: Genotype probabilities (founder allele probabilities) for each tissue (separated by chromosome within tissues, samples x allele x marker)covar_list
: Covariate dataframes for each tissue. Covariates may include sex, diet, age, batch number, etc.expr_list
: Copy of normalized expression matrices for each tissue (phenotype x samples)exprZ_list
: Copy of rankZ transformed expression matrices for each tissue (samples x phenotype)kinship_loco
: Sample kinship matrix for each tissue, calculated from the sample genoprobs using the leave one chromosome out (loco) method (separated by chromosome within tissues, samples x samples)gmap
: Genetic map of markers (centiMorgans cM, separated by chromosome)map_dat2
: Dataframe of markers, genetic and physical positions, and the order that markers appear in the genome.pmap
: Physical map of markers (basepairs bp, separated by chromosome)tissue_samp
: Copy of sample metadata separated out by tissue
str(map_peaks$peaks_list)
List of 1
$ liver:'data.frame': 906 obs. of 6 variables:
..$ phenotype: chr [1:906] "UNK_10.352_914.61176_minus" "UNK_10.352_914.61176_minus" "UNK_10.352_914.61176_minus" "UNK_10.352_914.61176_minus" ...
..$ peak_chr : Factor w/ 20 levels "1","2","3","4",..: 5 11 12 15 20 3 5 8 10 11 ...
..$ peak_bp : num [1:906] 7.68e+07 6.19e+06 1.19e+07 6.79e+07 1.59e+08 ...
..$ lod : num [1:906] 5.85 5.19 5.02 6.86 5.21 ...
..$ ci_lo : num [1:906] 72517079 3070782 3082700 65182973 12309718 ...
..$ ci_hi : num [1:906] 9.73e+07 1.16e+08 8.70e+07 7.03e+07 1.63e+08 ...
If you would like the genetic map distances (cM) rather than physical map distances (bp) for the QTL peaks, you can use interp_cM
, or you can use phys = F
when you call mapQTL
or runQTL
. By default QTLretrievR
does not return LOD scores < 5; if you would like to change that threshold, you can use the thrA
and thrX
for the autosomes and X chromosome, respectively.
summary(map_peaks$peaks_list$liver)
phenotype peak_chr peak_bp lod
Length:906 15 : 67 Min. : 3309356 Min. : 5.001
Class :character 6 : 59 1st Qu.: 43624520 1st Qu.: 5.239
Mode :character 18 : 59 Median : 76768036 Median : 5.517
11 : 57 Mean : 77905714 Mean : 5.872
5 : 56 3rd Qu.:111810346 3rd Qu.: 5.981
X : 56 Max. :193382813 Max. :28.765
(Other):552
ci_lo ci_hi
Min. : 3050998 Min. : 6461844
1st Qu.: 9575738 1st Qu.: 83157539
Median : 26815537 Median :114629680
Mean : 38118599 Mean :108771467
3rd Qu.: 57496244 3rd Qu.:140278885
Max. :189248419 Max. :195014334
We can also visualize what the peaks look like overall.
## Use a phenotype that will be used later in the vignette
peak_plot(mapping = map_peaks$maps_list,
tissue = "liver",
pheno = "UNK_17.944_965.82587_plus",
pop = "do",
psave = F)
SLURM job detected, using 16 cores (16 tasks, 1 cores per task).
NULL
And visualize the allelic effects that are occurring at a specific peak.
## Chromosome 15 based on the peak location above
## Plot effects
peak_plot(mapping = map_peaks$maps_list,
tissue = "liver",
pheno = "UNK_17.944_965.82587_plus",
pop = "do",
chrom = 15,
effects = T,
psave = F)
SLURM job detected, using 16 cores (16 tasks, 1 cores per task).
calculating effects
NULL
LOD Thresholding
Once we have calculated our peaks, it is a good idea to determine our significant and suggestive LOD thresholds. This will help us to determine which phenotypes/QTLs to investigate further. If you are performing eQTL mapping, this can also influence which peaks are included in the characterization of a distal hotspot.
Here, we use the function LOD_thld
; we will choose 5 lipid phenotypes to run 500 permutations on, and then take the median values to determine our significant (alpha = 0.05
) and suggestive (alpha = 0.4
) LOD thresholds. The default for this function is to use 10 phenotypes and run 1000 permutations each. This number is appropriate for eQTL analyses where there are many thousands of gene expression phenotypes, but is unnecessary when you only have ~1500 liver lipids.
<- Sys.time()
start_time
<- LOD_thld(mapping = map_peaks$maps_list,
thlds tissue = "liver",
n.gene = 5,
n.perm = 500)
SLURM job detected, using 16 cores (16 tasks, 1 cores per task).
Using 5 phenotypes: UNK_9.634_726.93475_minus UNK_11.811_1117.70117_minus TG_18.1_18.1_20.3_1_17.719_926.80585_plus TG_18.0_18.0_20.4_19.256_933.77148_plus UNK_1.953_482.35989_plus
Running 500 permutations on 5 genes in 1 batches.
<- Sys.time()
end_time
- start_time end_time
Time difference of 4.259334 hours
str(thlds)
List of 2
$ perms_df: 'scan1perm' num [1:500, 1:5] 6.29 6.49 5.47 6.67 5.3 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:5] "UNK_9.634_726.93475_minus" "UNK_11.811_1117.70117_minus" "TG_18.1_18.1_20.3_1_17.719_926.80585_plus" "TG_18.0_18.0_20.4_19.256_933.77148_plus" ...
$ thlds :List of 2
..$ sigLOD : num 7.7
..$ suggLOD: num 6.3
We return the permutations object as well as the significant and suggestive LODs in a named list. Our significant and suggestive LOD thresholds 7.697 and 6.297 respectively, so we will use those values in our mediation analysis and haplotype effect identification.
Next Steps
There are a number of next steps that can be taken to follow up on the peak identification and thresholding. These include ascertaining the founder strain alleles responsible for driving the observed QTL effect, and performing mediation analysis to identify potential causal intermediates that are responsible for conferring a QTL.These analyses are described below.
Founder Strain Effects at a QTL peak
We can determine the founder strain coefficients at different QTL peaks.
## Set significant and suggestive thresholds
<- thlds$thlds$sigLOD
sig_LOD <- thlds$thlds$suggLOD
sugg_LOD
## Calculate Effects
<- qtl_effects(peaks = map_peaks$peaks_list,
eff_out mapping = map_peaks$maps_list,
suggLOD = sugg_LOD,
outdir = "../../_data/",
effects_out = "liver_lipids_mm39_effects.rds",
save = "sr")
data checked
peaks extracted, calculating effects now
SLURM job detected, using 16 cores (16 tasks, 1 cores per task).
SLURM job detected, using 16 cores (16 tasks, 1 cores per task).
Registering 8 cores and passing 8 cores per tissue to 1 tissue(s).
effects calculated. saving to RDS
## Combine significant peaks with effects
<- eff_out$peaks$liver |>
peak_effects left_join(eff_out$effects_blup$liver |>
as.data.frame() |>
rownames_to_column(var = "marker"), by = join_by(marker))
Now, we are going to look at the peaks that are mapping across the genome and identify a location that has multiple phenotypes associated with it. This results in 3 phenotypes that map to the same position on chromosome 15, which we will continue to use as an example for the rest of this vignette.
## Identify peaks to use
$peaks_list$liver |>
map_peaks::filter(lod > sig_LOD) |>
dplyr::group_by(peak_chr) |>
dplyr::count() dplyr
# A tibble: 14 × 2
# Groups: peak_chr [14]
peak_chr n
<fct> <int>
1 1 1
2 2 2
3 3 1
4 5 2
5 6 1
6 7 1
7 8 1
8 10 2
9 11 4
10 12 1
11 15 10
12 17 3
13 18 2
14 X 2
$peaks_list$liver |>
map_peaks::filter(lod > sig_LOD, peak_chr == 15) |>
dplyr::arrange(peak_bp) dplyr
phenotype peak_chr peak_bp lod ci_lo
1 UNK_17.944_965.82587_plus 15 82953003 9.466032 82319925
2 TG_18.2_18.1_20.3_17.162_924.79926_plus 15 82953003 9.229160 82288820
3 TG_18.1_18.1_22.4_18.068_952.83057_plus 15 82953003 10.047826 82319925
4 UNK_18.204_901.80334_plus 15 82977704 13.624220 82277678
5 UNK_16.716_949.80194_plus 15 82977704 12.481903 82288820
6 UNK_16.381_987.81036_plus 15 83968940 9.304922 82406369
7 UNK_16.905_1015.84003_plus 15 83968940 12.324044 82531636
8 UNK_17.448_1011.75751_plus 15 84016072 10.476459 82494131
9 UNK_18.104_1018.88293_minus 15 84085545 16.516295 82288820
10 UNK_14.308_596.59662_plus 15 84374485 8.581107 83950353
ci_hi
1 84576516
2 84576516
3 84576516
4 84576516
5 84576516
6 84576516
7 84576516
8 84576516
9 84374485
10 85513369
## Set the position of the peak
<- 82953003
chr15_peak_pos
## Subset the peak in order to get the features
<- map_peaks$peaks_list$liver |>
chr15_peak_sub ::filter(lod > sig_LOD,
dplyr== 15,
peak_chr == chr15_peak_pos)
peak_bp
## Look at effects for the peaks on chromosome 15
|> dplyr::filter(phenotype %in% chr15_peak_sub$phenotype, peak_chr == 15) |>
peak_effects ::select(phenotype, peak_chr, peak_bp, lod, marker, LETTERS[1:8]) dplyr
phenotype peak_chr peak_bp lod
1 UNK_17.944_965.82587_plus 15 82953003 9.466032
2 TG_18.2_18.1_20.3_17.162_924.79926_plus 15 82953003 9.229160
3 TG_18.1_18.1_22.4_18.068_952.83057_plus 15 82953003 10.047826
marker A B C D E F
1 UNC25989251 0.06179842 0.5062977 0.1507991 0.2932127 -0.173424 -0.1548653
2 UNC25989251 0.06179842 0.5062977 0.1507991 0.2932127 -0.173424 -0.1548653
3 UNC25989251 0.06179842 0.5062977 0.1507991 0.2932127 -0.173424 -0.1548653
G H
1 -0.8792315 0.1954129
2 -0.8792315 0.1954129
3 -0.8792315 0.1954129
For those unfamiliar with the Diversity Outbred (DO) mouse population, it is an outbred population derived from 8 founder inbred strains represented above as A-H. Each letter corresponds to a different founder, in order they are: AJ, B6, 129, NOD, NZO, CAST, PWK, and WSB.
For the Chromosome 15 lipid QTL, it appears that the PWK founder allele is driving the interindividual variation that we’re seeing in these three lipid phenotypes. The next step would be to perform mediation analysis (below).
Mediation Analysis
For mediation analysis, we want to look for an expressed gene or other molecular trait (e.g., region of open chromatin, methylated region) whose interindividual variation in expression/accessibility/etc aligns with the variation in liver lipids that we are seeing for a given distal QTL peak. Identification of these “causal intermediates” provide regulator-target hypotheses for validation with additional experiments. To perform mediation analysis with transcriptomic data, we take gene expression data from a relevant tissue, liver, then re-calculate the LOD scores for the lipid QTL peaks, taking the expression of each candidate mediator gene into account as an additive covariate in the QTL mapping model. Then we look for the genes that have the largest LOD drop for a given phenotype within its peak, indicating that variation in that candidate mediator gene best explains the distal QTL-driven variation in the lipid phenotype(s).
To run mediation, we will first have to load in the mm39-aligned liver expression data, specifically the rankZ normalized data.
## Read in the expression data
<- readRDS("../../_data/dataset.liver.GRCm39.v1.rds")
liver_expr
## Use the normalized and rankZ transformed gene expression values for mediation
<- liver_expr$data$rz
liver_rz
## clean up
rm("liver_expr")
## Run mediation
<- Sys.time()
start_time
<- multi_modiFinder(peaks = map_peaks$peaks_list,
med_out mapping = map_peaks$maps_list,
exprZ = liver_rz,
suggLOD = sugg_LOD,
annots = annot_105,
outdir = "../../_data/",
med_out = "liver_lipids_mm39_meds.rds",
save = "sr")
load annotations
checking peaks and mapping
data checked
Number of common samples for liver: 369
filtered peaks
liver
running mediation
SLURM job detected, using 16 cores (16 tasks, 1 cores per task).
Registering 8 cores and passing 8 cores per tissue to 1 tissue(s).
<- Sys.time()
end_time
- start_time end_time
Time difference of 7.897316 mins
Again using our 3 phenotypes mapping to chromosome 15, we’re going to look at the genes within +/- 2Mb of the identified peak location (82953003) and identify the gene(s) whose variation in expression best explain the variation we see in that phenotype. For each phenotype, we will identify the top 5 mediators and plot the percent LOD drop in a heatmap.
<- medPlot_single(meds = med_out$liver,
chr15_peak_med_rank range = 2,
position = chr15_peak_pos,
feats = chr15_peak_sub$phenotype,
chromosome = 15,
plot = "per_drop",
psave = F)
Joining with `by = join_by(target_id, qtl_lod, qtl_chr, mediator, mediator_id,
mediator_chr, mediator_midpoint, LOD, LOD_drop)`
Joining with `by = join_by(target_id, qtl_lod, qtl_chr, mediator, mediator_id,
mediator_chr, mediator_midpoint, LOD, LOD_drop)`
Joining with `by = join_by(target_id, qtl_lod, qtl_chr, mediator, mediator_id,
mediator_chr, mediator_midpoint, LOD, LOD_drop)`
From this plot, it appears that two predicted genes, a predicted lncRNA gene Gm49463
and predicted protein coding gene 5031439G07Rik
, are strong candidate mediators for all 3 phenotypes, both corresponding to a high percent LOD drop in the distal lipid QTL peaks. To verify that, we can look at the dataframe that has the target and mediator information, including rank and significance.
## Filter ranked mediation results to candidate mediators
$meds_ranked |>
chr15_peak_med_rank::filter(mediator %in% c("Gm49463", "5031439G07Rik")) |>
dplyr::select(target_id, qtl_lod, qtl_chr, mediator, LOD, LOD_drop, padj, per_drop, ranks) dplyr
# A tibble: 6 × 9
target_id qtl_lod qtl_chr mediator LOD LOD_drop padj per_drop ranks
<chr> <dbl> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 UNK_17.944_965.8… 9.47 15 Gm49463 2.81 6.65 0 0.703 2
2 UNK_17.944_965.8… 9.47 15 5031439… 2.49 6.98 0 0.737 1
3 TG_18.2_18.1_20.… 9.23 15 Gm49463 2.28 6.95 0 0.753 2
4 TG_18.2_18.1_20.… 9.23 15 5031439… 2.17 7.06 0 0.764 1
5 TG_18.1_18.1_22.… 10.0 15 Gm49463 4.67 5.37 0 0.535 2
6 TG_18.1_18.1_22.… 10.0 15 5031439… 4.32 5.73 0 0.570 1
It looks like both genes are the first and second ranked mediators for all 3 phenotypes, with 5031439G07Rik
being consistently the top ranked and significant for each one.