QTLretrievR Attie DO500 example

Author

Hannah Dewey

Published

September 17, 2025

Introduction

This is a vignette to demonstrate using QTLretrievR with the Attie DO500 dataset that is available at QTLviewer and published in Nature Communications. We will be using the liver lipids dataset with the mice genotyped using the Neogen GigaMUGA 143k SNP array.

First we will set up our environment by loading QTLretrievR and tidyverse.

Next we will load in our data. Then we will do some cleanup so only the objects we need for QTLretrievR are preserved in the environment.

The markers associated with these genoprobs are from the GigaMUGA grid map, so from the .Rdata file we only need to retain the genoprobs object, since the GigaMUGA gridfile is available through QTLretrievR. From the liver_lipids object, we want to retain: - annot.samples - $data$norm The sample annotations, and the normalized lipid counts.

## read in data
load("../../_data/attie.core.GRCm39.v1.Rdata")
liver_lipids <- readRDS(url("https://churchilllab.jax.org/qtlviewer/attie/DO500HFD/dl?fileName=LiverLipids_Phenotypes_V11.rds"))

## save annotations (metadata) and normalized counts
sample_meta <- as.data.frame(liver_lipids$annot.samples)
sample_meta <- sample_meta |> 
  dplyr::select(ID = MouseID, sex = Sex)

norm_counts <- liver_lipids$data$norm

## clean up
rm(ensembl_version, K, liver_lipids, map, markers)

Running QTLretrievR

Now that we have our data set up and R environment cleaned up, we’re ready to start running QTLretrievR. We already have our genoprobs in the correct format (founder allele probabilities). To see the general overview of how to impute genotype probabilities from GigaMUGA SNP-level calls, please see our overview vignette.

This demo will show running each of the peak calling, effects calculation, and mediation analysis steps separately. If you wish to run all three steps together, you can use the runQTL function. If you run this option, keep in mind that the effects and mediation steps require input of a suggestive LOD score threshold, so you should filter again after identifying your suggestive and significant LOD scores.

Mapping & Peak Calling

The following mapping will be performed by passing objects located in the working environment; paths to objects could also be passed if you have them saved as individual rds files. Please note - QTLretrievR is set up to work with data from multiple tissues. This means that even if you are only working with a single tissue, the genoprobs and phenotype matrices (expr_mats) need to be passed in list format with the tissue type. In this example, the normalized lipid counts are set up with the lipid phenotypes as columns and the samples as rows, so we have to transpose the count matrix to pass it to mapQTL. If your counts have already been rankZ transformed and each column is a different phenotype, you can specify rz = T in the submission and it will skip the transformation.

## Downsample to 200 liver lipids
phenotypes <- sample(colnames(norm_counts), 200, replace = FALSE)

## Perform QTL Mapping
start_time <- Sys.time()

map_peaks <- mapQTL(outdir        = "../../_data/",
                    peaks_out     = "liver_lipids_GM_peaks_sub.rds",
                    map_out       = "liver_lipids_GM_map_sub.rds",
                    genoprobs     = list("liver" = genoprobs),
                    samp_meta     = sample_meta,
                    expr_mats     = list("liver" = t(norm_counts[,phenotypes])),
                    covar_factors = c("sex"),
                    gridFile      = gridfile_GM,
                    save          = "ro")

checking probabilities

min_cores not provided using 4 cores as minimum

liver

gmap complete

pmap complete

expression loaded

Working with 369 samples and 200 genes in liver.

rankZ normalized

covariates calculated

calculating peaks

SLURM job detected, using 16 cores (16 tasks, 1 cores per task).

Registering 8 cores and passing 8 cores per tissue to 1 tissue(s). Does that look right? If not please
                 set total_cores parameter to the number of available cores.

end_time <- Sys.time()

end_time - start_time

Time difference of 26.05502 mins

You may notice that there were 16 cores found available on our Linux system, but only 8 were registered and used for peak calling. This is because with < 1000 phenotypes, we have found that using more than 8 cores does not appreciably decrease the amount of time used to do the peak calling.

To now briefly look at what the map_peaks object looks like. QTLretrievR returns everything in list format separating everything based on tissue. Since we are only using liver lipids, our tissue is liver. If we were using lipid measurements from multiple tissues, each of the lists below would have length > 1 and be named for each tissue.

Each list element of the map_peaks$maps_list object is a list of the objects for different tissues.

names(map_peaks$maps_list)

[1] "qtlprobs"     "covar_list"   "expr_list"    "exprZ_list"   "kinship_loco"
[6] "gmap"         "map_dat2"     "pmap"         "tissue_samp"

str(map_peaks$maps_list$qtlprobs$liver[[1]])

 num [1:483, 1:8, 1:9449] 1.26e-03 8.19e-03 8.20e-03 8.20e-03 8.47e-08 ...
 - attr(*, "dimnames")=List of 3
  ..$ : chr [1:483] "DO022" "DO023" "DO024" "DO025" ...
  ..$ : chr [1:8] "A" "B" "C" "D" ...
  ..$ : chr [1:9449] "UNC6" "UNCHS000001" "JAX00000010" "JAX00000010r" ...

str(map_peaks$maps_list$kinship_loco$liver[[1]])

 num [1:483, 1:483] 0.553 0.13 0.149 0.13 0.156 ...
 - attr(*, "n_pos")= int 114433
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:483] "DO022" "DO023" "DO024" "DO025" ...
  ..$ : chr [1:483] "DO022" "DO023" "DO024" "DO025" ...

maps_list preserves the following for future use in either plotting functions, or if you want to re-run a specific tissue or phenotype.

qtlprobs: Genotype probabilities (founder allele probabilities) for each tissue (separated by chromosome within tissues, samples x allele x marker)
covar_list: Covariate dataframes for each tissue. Covariates may include sex, diet, age, batch number, etc.
expr_list: Copy of normalized expression matrices for each tissue (phenotype x samples)
exprZ_list: Copy of rankZ transformed expression matrices for each tissue (samples x phenotype)
kinship_loco: Sample kinship matrix for each tissue, calculated from the sample genoprobs using the leave one chromosome out (loco) method (separated by chromosome within tissues, samples x samples)
gmap: Genetic map of markers (centiMorgans cM, separated by chromosome)
map_dat2: Dataframe of markers, genetic and physical positions, and the order that markers appear in the genome.
pmap: Physical map of markers (basepairs bp, separated by chromosome)
tissue_samp: Copy of sample metadata separated out by tissue

str(map_peaks$peaks_list)

List of 1
 $ liver:'data.frame':  906 obs. of  6 variables:
  ..$ phenotype: chr [1:906] "UNK_10.352_914.61176_minus" "UNK_10.352_914.61176_minus" "UNK_10.352_914.61176_minus" "UNK_10.352_914.61176_minus" ...
  ..$ peak_chr : Factor w/ 20 levels "1","2","3","4",..: 5 11 12 15 20 3 5 8 10 11 ...
  ..$ peak_bp  : num [1:906] 7.68e+07 6.19e+06 1.19e+07 6.79e+07 1.59e+08 ...
  ..$ lod      : num [1:906] 5.85 5.19 5.02 6.86 5.21 ...
  ..$ ci_lo    : num [1:906] 72517079 3070782 3082700 65182973 12309718 ...
  ..$ ci_hi    : num [1:906] 9.73e+07 1.16e+08 8.70e+07 7.03e+07 1.63e+08 ...

If you would like the genetic map distances (cM) rather than physical map distances (bp) for the QTL peaks, you can use interp_cM, or you can use phys = F when you call mapQTL or runQTL. By default QTLretrievR does not return LOD scores < 5; if you would like to change that threshold, you can use the thrA and thrX for the autosomes and X chromosome, respectively.

summary(map_peaks$peaks_list$liver)

  phenotype            peak_chr      peak_bp               lod        
 Length:906         15     : 67   Min.   :  3309356   Min.   : 5.001  
 Class :character   6      : 59   1st Qu.: 43624520   1st Qu.: 5.239  
 Mode  :character   18     : 59   Median : 76768036   Median : 5.517  
                    11     : 57   Mean   : 77905714   Mean   : 5.872  
                    5      : 56   3rd Qu.:111810346   3rd Qu.: 5.981  
                    X      : 56   Max.   :193382813   Max.   :28.765  
                    (Other):552                                       
     ci_lo               ci_hi          
 Min.   :  3050998   Min.   :  6461844  
 1st Qu.:  9575738   1st Qu.: 83157539  
 Median : 26815537   Median :114629680  
 Mean   : 38118599   Mean   :108771467  
 3rd Qu.: 57496244   3rd Qu.:140278885  
 Max.   :189248419   Max.   :195014334

We can also visualize what the peaks look like overall.

## Use a phenotype that will be used later in the vignette
peak_plot(mapping = map_peaks$maps_list,
          tissue  = "liver",
          pheno   = "UNK_17.944_965.82587_plus", 
          pop     = "do",
          psave   = F)

SLURM job detected, using 16 cores (16 tasks, 1 cores per task).

NULL

And visualize the allelic effects that are occurring at a specific peak.

## Chromosome 15 based on the peak location above
## Plot effects
peak_plot(mapping = map_peaks$maps_list,
          tissue  = "liver",
          pheno   = "UNK_17.944_965.82587_plus", 
          pop     = "do",
          chrom = 15, 
          effects = T,
          psave   = F)

SLURM job detected, using 16 cores (16 tasks, 1 cores per task).

calculating effects

NULL

LOD Thresholding

Once we have calculated our peaks, it is a good idea to determine our significant and suggestive LOD thresholds. This will help us to determine which phenotypes/QTLs to investigate further. If you are performing eQTL mapping, this can also influence which peaks are included in the characterization of a distal hotspot.

Here, we use the function LOD_thld; we will choose 5 lipid phenotypes to run 500 permutations on, and then take the median values to determine our significant (alpha = 0.05) and suggestive (alpha = 0.4) LOD thresholds. The default for this function is to use 10 phenotypes and run 1000 permutations each. This number is appropriate for eQTL analyses where there are many thousands of gene expression phenotypes, but is unnecessary when you only have ~1500 liver lipids.

start_time <- Sys.time()

thlds <- LOD_thld(mapping = map_peaks$maps_list,
                  tissue  = "liver", 
                  n.gene  = 5, 
                  n.perm  = 500)

SLURM job detected, using 16 cores (16 tasks, 1 cores per task).

Using 5 phenotypes: UNK_9.634_726.93475_minus  UNK_11.811_1117.70117_minus  TG_18.1_18.1_20.3_1_17.719_926.80585_plus  TG_18.0_18.0_20.4_19.256_933.77148_plus  UNK_1.953_482.35989_plus

Running 500 permutations on 5 genes in 1 batches.

end_time <- Sys.time()

end_time - start_time

Time difference of 4.259334 hours

str(thlds)

List of 2
 $ perms_df: 'scan1perm' num [1:500, 1:5] 6.29 6.49 5.47 6.67 5.3 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:5] "UNK_9.634_726.93475_minus" "UNK_11.811_1117.70117_minus" "TG_18.1_18.1_20.3_1_17.719_926.80585_plus" "TG_18.0_18.0_20.4_19.256_933.77148_plus" ...
 $ thlds   :List of 2
  ..$ sigLOD : num 7.7
  ..$ suggLOD: num 6.3

We return the permutations object as well as the significant and suggestive LODs in a named list. Our significant and suggestive LOD thresholds 7.697 and 6.297 respectively, so we will use those values in our mediation analysis and haplotype effect identification.

Next Steps

There are a number of next steps that can be taken to follow up on the peak identification and thresholding. These include ascertaining the founder strain alleles responsible for driving the observed QTL effect, and performing mediation analysis to identify potential causal intermediates that are responsible for conferring a QTL.These analyses are described below.

Founder Strain Effects at a QTL peak

We can determine the founder strain coefficients at different QTL peaks.

## Set significant and suggestive thresholds
sig_LOD <- thlds$thlds$sigLOD
sugg_LOD <- thlds$thlds$suggLOD

## Calculate Effects
eff_out <- qtl_effects(peaks       = map_peaks$peaks_list,
                       mapping     = map_peaks$maps_list,
                       suggLOD     = sugg_LOD, 
                       outdir      = "../../_data/", 
                       effects_out = "liver_lipids_mm39_effects.rds", 
                       save        = "sr")

data checked

peaks extracted, calculating effects now

SLURM job detected, using 16 cores (16 tasks, 1 cores per task).
SLURM job detected, using 16 cores (16 tasks, 1 cores per task).

Registering 8 cores and passing 8 cores per tissue to 1 tissue(s).

effects calculated. saving to RDS

## Combine significant peaks with effects
peak_effects <- eff_out$peaks$liver |> 
  left_join(eff_out$effects_blup$liver |> 
              as.data.frame() |> 
              rownames_to_column(var = "marker"), by = join_by(marker))

Now, we are going to look at the peaks that are mapping across the genome and identify a location that has multiple phenotypes associated with it. This results in 3 phenotypes that map to the same position on chromosome 15, which we will continue to use as an example for the rest of this vignette.

## Identify peaks to use
map_peaks$peaks_list$liver |> 
  dplyr::filter(lod > sig_LOD) |> 
  dplyr::group_by(peak_chr) |> 
  dplyr::count()

# A tibble: 14 × 2
# Groups:   peak_chr [14]
   peak_chr     n
   <fct>    <int>
 1 1            1
 2 2            2
 3 3            1
 4 5            2
 5 6            1
 6 7            1
 7 8            1
 8 10           2
 9 11           4
10 12           1
11 15          10
12 17           3
13 18           2
14 X            2

map_peaks$peaks_list$liver |> 
  dplyr::filter(lod > sig_LOD, peak_chr == 15) |> 
  dplyr::arrange(peak_bp)

                                 phenotype peak_chr  peak_bp       lod    ci_lo
1                UNK_17.944_965.82587_plus       15 82953003  9.466032 82319925
2  TG_18.2_18.1_20.3_17.162_924.79926_plus       15 82953003  9.229160 82288820
3  TG_18.1_18.1_22.4_18.068_952.83057_plus       15 82953003 10.047826 82319925
4                UNK_18.204_901.80334_plus       15 82977704 13.624220 82277678
5                UNK_16.716_949.80194_plus       15 82977704 12.481903 82288820
6                UNK_16.381_987.81036_plus       15 83968940  9.304922 82406369
7               UNK_16.905_1015.84003_plus       15 83968940 12.324044 82531636
8               UNK_17.448_1011.75751_plus       15 84016072 10.476459 82494131
9              UNK_18.104_1018.88293_minus       15 84085545 16.516295 82288820
10               UNK_14.308_596.59662_plus       15 84374485  8.581107 83950353
      ci_hi
1  84576516
2  84576516
3  84576516
4  84576516
5  84576516
6  84576516
7  84576516
8  84576516
9  84374485
10 85513369

## Set the position of the peak
chr15_peak_pos <- 82953003

## Subset the peak in order to get the features
chr15_peak_sub <- map_peaks$peaks_list$liver |> 
  dplyr::filter(lod      >  sig_LOD, 
                peak_chr == 15, 
                peak_bp  == chr15_peak_pos)

## Look at effects for the peaks on chromosome 15
peak_effects |> dplyr::filter(phenotype %in% chr15_peak_sub$phenotype, peak_chr == 15) |> 
  dplyr::select(phenotype, peak_chr, peak_bp, lod, marker, LETTERS[1:8])

                                phenotype peak_chr  peak_bp       lod
1               UNK_17.944_965.82587_plus       15 82953003  9.466032
2 TG_18.2_18.1_20.3_17.162_924.79926_plus       15 82953003  9.229160
3 TG_18.1_18.1_22.4_18.068_952.83057_plus       15 82953003 10.047826
       marker          A         B         C         D         E          F
1 UNC25989251 0.06179842 0.5062977 0.1507991 0.2932127 -0.173424 -0.1548653
2 UNC25989251 0.06179842 0.5062977 0.1507991 0.2932127 -0.173424 -0.1548653
3 UNC25989251 0.06179842 0.5062977 0.1507991 0.2932127 -0.173424 -0.1548653
           G         H
1 -0.8792315 0.1954129
2 -0.8792315 0.1954129
3 -0.8792315 0.1954129

For those unfamiliar with the Diversity Outbred (DO) mouse population, it is an outbred population derived from 8 founder inbred strains represented above as A-H. Each letter corresponds to a different founder, in order they are: AJ, B6, 129, NOD, NZO, CAST, PWK, and WSB.

For the Chromosome 15 lipid QTL, it appears that the PWK founder allele is driving the interindividual variation that we’re seeing in these three lipid phenotypes. The next step would be to perform mediation analysis (below).

Mediation Analysis

For mediation analysis, we want to look for an expressed gene or other molecular trait (e.g., region of open chromatin, methylated region) whose interindividual variation in expression/accessibility/etc aligns with the variation in liver lipids that we are seeing for a given distal QTL peak. Identification of these “causal intermediates” provide regulator-target hypotheses for validation with additional experiments. To perform mediation analysis with transcriptomic data, we take gene expression data from a relevant tissue, liver, then re-calculate the LOD scores for the lipid QTL peaks, taking the expression of each candidate mediator gene into account as an additive covariate in the QTL mapping model. Then we look for the genes that have the largest LOD drop for a given phenotype within its peak, indicating that variation in that candidate mediator gene best explains the distal QTL-driven variation in the lipid phenotype(s).

To run mediation, we will first have to load in the mm39-aligned liver expression data, specifically the rankZ normalized data.

## Read in the expression data
liver_expr <- readRDS("../../_data/dataset.liver.GRCm39.v1.rds")

## Use the normalized and rankZ transformed gene expression values for mediation
liver_rz <- liver_expr$data$rz

## clean up
rm("liver_expr")

## Run mediation
start_time <- Sys.time()

med_out <- multi_modiFinder(peaks   = map_peaks$peaks_list, 
                            mapping = map_peaks$maps_list, 
                            exprZ   = liver_rz, 
                            suggLOD = sugg_LOD, 
                            annots  = annot_105, 
                            outdir  = "../../_data/", 
                            med_out = "liver_lipids_mm39_meds.rds", 
                            save    = "sr")

load annotations

checking peaks and mapping

data checked

Number of common samples for liver: 369

filtered peaks

liver

running mediation

SLURM job detected, using 16 cores (16 tasks, 1 cores per task).

Registering 8 cores and passing 8 cores per tissue to 1 tissue(s).

end_time <- Sys.time()

end_time - start_time

Time difference of 7.897316 mins

Again using our 3 phenotypes mapping to chromosome 15, we’re going to look at the genes within +/- 2Mb of the identified peak location (82953003) and identify the gene(s) whose variation in expression best explain the variation we see in that phenotype. For each phenotype, we will identify the top 5 mediators and plot the percent LOD drop in a heatmap.

chr15_peak_med_rank <- medPlot_single(meds       = med_out$liver,
                                      range      = 2,
                                      position   = chr15_peak_pos,
                                      feats      = chr15_peak_sub$phenotype,
                                      chromosome = 15,
                                      plot       = "per_drop",
                                      psave      = F)

Joining with `by = join_by(target_id, qtl_lod, qtl_chr, mediator, mediator_id,
mediator_chr, mediator_midpoint, LOD, LOD_drop)`
Joining with `by = join_by(target_id, qtl_lod, qtl_chr, mediator, mediator_id,
mediator_chr, mediator_midpoint, LOD, LOD_drop)`
Joining with `by = join_by(target_id, qtl_lod, qtl_chr, mediator, mediator_id,
mediator_chr, mediator_midpoint, LOD, LOD_drop)`

From this plot, it appears that two predicted genes, a predicted lncRNA gene Gm49463 and predicted protein coding gene 5031439G07Rik, are strong candidate mediators for all 3 phenotypes, both corresponding to a high percent LOD drop in the distal lipid QTL peaks. To verify that, we can look at the dataframe that has the target and mediator information, including rank and significance.

## Filter ranked mediation results to candidate mediators
chr15_peak_med_rank$meds_ranked |> 
  dplyr::filter(mediator %in% c("Gm49463", "5031439G07Rik")) |> 
  dplyr::select(target_id, qtl_lod, qtl_chr, mediator, LOD, LOD_drop, padj, per_drop, ranks)

# A tibble: 6 × 9
  target_id         qtl_lod qtl_chr mediator   LOD LOD_drop  padj per_drop ranks
  <chr>               <dbl> <fct>   <chr>    <dbl>    <dbl> <dbl>    <dbl> <dbl>
1 UNK_17.944_965.8…    9.47 15      Gm49463   2.81     6.65     0    0.703     2
2 UNK_17.944_965.8…    9.47 15      5031439…  2.49     6.98     0    0.737     1
3 TG_18.2_18.1_20.…    9.23 15      Gm49463   2.28     6.95     0    0.753     2
4 TG_18.2_18.1_20.…    9.23 15      5031439…  2.17     7.06     0    0.764     1
5 TG_18.1_18.1_22.…   10.0  15      Gm49463   4.67     5.37     0    0.535     2
6 TG_18.1_18.1_22.…   10.0  15      5031439…  4.32     5.73     0    0.570     1

It looks like both genes are the first and second ranked mediators for all 3 phenotypes, with 5031439G07Rik being consistently the top ranked and significant for each one.