Islet Transcriptomics Example • QTLretrievR

Introduction

This is a vignette to demonstrate using QTLretrievR with the Attie DO500 dataset that is available at QTLviewer and published in Nature Communications. We will be using the islet transcriptomics dataset with the mice genotyped using the Neogen GigaMUGA 143k SNP array.

First we will set up our environment by loading QTLretrievR and tidyverse.

Next we will load in our data. Then we will do some cleanup so only the objects we need for QTLretrievR are preserved in the environment.

The markers associated with these genoprobs are from the GigaMUGA grid map, so from the .Rdata file we only need to retain the genoprobs object, since the GigaMUGA gridfile is available through QTLretrievR. From the liver_lipids object, we want to retain: - annot.samples - $data$norm The sample annotations, and the normalized lipid counts.

## read in data
load("../../_data/attie.core.GRCm39.v1.Rdata")
options(timeout = 600)
islet <- readRDS(url("https://churchilllab.jax.org/qtlviewer/attie/DO500HFD/dl?fileName=Islet.RDS"))

## save annotations (metadata) and normalized counts
sample_meta <- as.data.frame(islet$annot.samples)
sample_meta <- sample_meta |> 
  dplyr::select(ID = mouse.id, sex = Sex, ngen = Generation)

norm_counts <- islet$data$norm

## clean up
rm(ensembl_version, K, islet, map, markers)

Running QTLretrievR

Now that we have our data set up and R environment cleaned up, we’re ready to start running QTLretrievR. We already have our genoprobs in the correct format (founder allele probabilities). To see the general overview of how to impute genotype probabilities from GigaMUGA SNP-level calls, please see our overview vignette.

This demo will show running each of the peak calling, effects calculation, and mediation analysis steps separately. If you wish to run all three steps together, you can use the runQTL function. If you run this option, keep in mind that the effects and mediation steps require input of a LOD score threshold, so you should filter again after identifying your suggestive and significant LOD scores.

Mapping & Peak Calling

The following mapping will be performed by passing objects located in the working environment; paths to objects could also be passed if you have them saved as individual rds files. Please note - QTLretrievR is set up to work with data from multiple tissues. This means that even if you are only working with a single tissue, the genoprobs and phenotype matrices (expr_mats) need to be passed in list format with the tissue type. In this example, the normalized lipid counts are set up with the lipid phenotypes as columns and the samples as rows, so we have to transpose the count matrix to pass it to mapQTL. If your counts have already been rankZ transformed and each column is a different phenotype, you can specify rz = T in the submission and it will skip the transformation.

## Perform QTL Mapping

map_peaks <- mapQTL(outdir        = "../../_data/",
                    peaks_out     = "islet_mm39_GM_peaks.rds",
                    map_out       = "islet_mm39_GM_map.rds",
                    genoprobs     = list("islet" = genoprobs),
                    samp_meta     = sample_meta,
                    expr_mats     = list("islet" = t(norm_counts)),
                    covar_factors = c("sex"),
                    gridFile      = gridfile_GM,
                    save          = "so",
                    min_cores     = 8,
                    total_cores   = 32)

When running this, QTLretrievR takes the total number of genes (22180) and breaks them into groups of 1000 genes. It takes the available cores (32) divides that by the number of tissues (1), and then divides that by minimum number of cores (8) to determine how many scan1 processes will run simultaneously for each tissue (4). Based on our benchmarking results (shown below) this takes ~7.75 hours to run mapping and peak calling from start to finish.

LOD Thresholding

Once we have calculated our peaks we need to determine what is considered a significant or suggestive LOD score. Using the LOD_thld function we can parallelize this thresholding and run thousands of permutations on multiple autosomal genes simultaneously. By default LOD_thld runs 1000 permutations on 10 autosomal genes in 2 parallel processes. Through testing we have found that taking the median threshold from multiple genes yields a comparable significance threshold to running permutations on each individual gene to determine the thresholds. Taking multiple genes into account stabilizes the resulting LOD threshold over choosing a single gene and running >1000 permutations on it.

Code

## Load in the map and peaks objects so that they can be used downstream
maps_list <- readRDS("../../_data/islet_mm39_GM_map.rds")
peaks_list <- readRDS("../../_data/islet_mm39_GM_peaks.rds")

islet_thlds <- LOD_thld(mapping = maps_list,
                        tissue  = "islet",
                        annots  = annot_105)

saveRDS(islet_thlds, "../../_data/islet_thlds.rds")

Code

## load in the thresholds for use downstream
islet_thlds <- readRDS("../../_data/islet_thlds.rds")

## save suggestive and significant thresholds to variables

sigTHLD <- islet_thlds$thlds$sigLOD
suggTHLD <- islet_thlds$thlds$suggLOD

Initial Visualization

Now that the significant and suggestive LOD thresholds have been identified we can visualize the eQTL map and look at some individual peaks. The significance threshold used in the original study was 8, using only the islet expression data to calculate our significant and suggestive LOD thresholds we get 7.68 and 6.34 respectively.

plot_eqtlmap(map_dat = maps_list$map_dat2,
             peaks   = peaks_list,
             sigLOD  = round(sigTHLD, 2),
             psave   = F)

$islet

Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

In addition we can look at the peak associated with Pparg which was one of the genes of interest in this study.

peak_plot(mapping = maps_list,
          tissue  = "islet",
          pheno   = "ENSMUSG00000000440",
          pop     = "do",
          psave   = F)

Working in Linux and there are 40 cores in total.

NULL

peak_plot(mapping    = maps_list,
          tissue     = "islet",
          pheno      = "ENSMUSG00000000440",
          chromosome = 6,
          pop        = "do",
          effects    = T, 
          psave      = F)

Working in Linux and there are 40 cores in total.

calculating effects

NULL

Next Steps

There are a number of next steps that can be taken to follow up on the peak identification and thresholding. These include ascertaining the founder strain alleles responsible for driving the observed QTL effect, and performing mediation analysis to identify potential causal intermediates that are responsible for conferring a QTL.These analyses are described below.

Transband Identification

We can identify the distal eQTL hotspots (transbands) from our significant and suggestive peaks using the transbands function. This uses sliding windows that are 50 markers wide and counts the number of significant and suggestive distal peaks within the windows as they slide across the chromosome. Windows that have the top 0.5% of peak density are selected and then collapsed into continuous bands.

tbands <- transbands(map_dat = maps_list$map_dat2, 
                     peaks   = peaks_list, 
                     sigLOD  = round(sigTHLD, 2), 
                     suggLOD = round(suggTHLD, 2), 
                     psave   = F)

identifying hotspots

generating plots

163819331.5 145579467.5 45590508 69642292 71233450.5 73377066.5 113106029.5 66018865.5 111866654

2 5 7 11 12 13

358833665.5 838197443.5 1039333294 1592508381 1594099539.5 1596243155.5 1757831005.5 1830725007.5 1876572796

tbands$bands.rna$islet

# A tibble: 9 × 8
  seqnames     start       end   width strand distant_rna distant_rna_sugg chr
  <fct>        <int>     <int>   <int> <fct>        <int>            <int> <fct>
1 2        162637473 165001190 2363718 *              139              210 2
2 5        143584435 147574500 3990066 *              125              344 5
3 7         44389033  46791983 2402951 *               71              203 7
4 11        68872949  70411635 1538687 *               32              137 11
5 11        70421517  72045384 1623868 *               69              170 11
6 11        72485997  74268136 1782140 *               36              129 11
7 12       112222853 113989206 1766354 *               28               45 12
8 13        64526964  67510767 2983804 *               24               25 13
9 13       110877549 112855759 1978211 *               67              151 13

tbands$trans_band_plot$islet

Note that there are hotspots on chromosomes 2, 5, 7, 11, 12, and 13.

Founder Strain Effects at a QTL peak

We can determine the founder strain coefficients at different QTL peaks.

## Calculate Effects
effects_list <- qtl_effects(peaks   = peaks_list,
                            mapping = maps_list,
                            suggLOD = islet_thlds$thlds$suggLOD,
                            outdir  = "../../_data/",
                            outfile = "islet_mm39_GM_effects.rds",
                            save    = "so")

Code

effects_list <- readRDS("../../_data/islet_mm39_GM_effects.rds")

Now we will look at the strain effects at the hotspot on chromosome 2.

hs_plots <- hsHapEffects(effects    = effects_list,
                         tbands     = tbands$bands.rna,
                         chromosome = 2,
                         tissue     = "islet", 
                         sigLOD     = 9, 
                         pop        = "do", 
                         vert       = TRUE,
                         showTarg   = FALSE,
                         psave      = FALSE)

hs_plots$ht_plot

Mediation Analysis

There are a couple of ways that we can do mediation, we can mediate within the islet expression data, or we can mediate between the islet expression data and other phenotypic data collected on the same samples.

Within Tissue Mediation

Within tissue mediation is straightforward with the modiFinder function. This builds on the mediation package r/intermediate to take the expression of other genes into account when mapping and peak calling.

meds_list <- modiFinder(peaks   = peaks_list,
                        mapping = maps_list,
                        suggLOD = round(suggTHLD, 2),
                        annots  = annot_105,
                        outdir  = "../../_data/",
                        med_out = "islet_mm39_GM_meds.rds",
                        save    = "so")

Code

meds_list <- readRDS("../../_data/islet_mm39_GM_meds.rds")

Now we can examine how introducing the expression of genes located within a hotspot influences the variation in expression of genes that map back to that same hotspot. Note that this mediation plot is showing the top 5 mediatiors for each target based on the LOD drop.

chr2_hs_med <- medPlot_hotSpot(peaks      = peaks_list$islet, 
                               meds       = meds_list$islet, 
                               tbands     = tbands$bands.rna$islet, 
                               sigLOD     = 9, 
                               chromosome = 2,
                               hsNum      = 1, 
                               top_n      = 5,
                               showTarg   = FALSE,
                               psave      = FALSE, 
                               plot       = "padj")

chr2_hs_med$meds_ranked |> 
  dplyr::filter(ranks <= 1) |> 
  dplyr::group_by(mediator) |> 
  dplyr::count() |> 
  dplyr::arrange(desc(n))

# A tibble: 15 × 2
# Groups:   mediator [15]
   mediator     n
   <chr>    <int>
 1 Hnf4a       77
 2 Mybl2        4
 3 Sgk2         3
 4 Ctsa         2
 5 Gdap1l1      1
 6 L3mbtl1      1
 7 Neurl2       1
 8 Pabpc1l      1
 9 Pkig         1
10 Rims4        1
11 Slc12a5      1
12 Slpi         1
13 Snx21        1
14 Stk4         1
15 Tomm34       1

Visually, it looks as if the top mediator for the targets of this hotspot is Hnf4a, and inspection of the ranked mediation results verifies that. Next we can look at PC1 of the hotspot targets and visualize where Hnf4a is located with respect to that peak.

## Identify targets of chromosome 2 hotspot
chr_2_targets <- unique(chr2_hs_med$meds_ranked$target_id)

## Dataframe of candidate mediator (Hnf4a)
cM <- annot_105 |> dplyr::filter(symbol == "Hnf4a")

## Run plotting/PC analysis
pc_effects <- hsPeakPlot(mapping      = maps_list,
                         feats        = chr_2_targets, 
                         tbands       = tbands$bands.rna,
                         chromosome   = 2,
                         tissue       = "islet", 
                         candidateMed = cM,
                         pc           = TRUE, 
                         wag          = TRUE, 
                         sigLOD       = round(sigTHLD, 2), 
                         psave        = FALSE)

Working in Linux and there are 40 cores in total.

Joining with `by = join_by(marker)`

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the QTLretrievR package.
  Please report the issue to the authors.

Warning in data.frame(..., check.names = FALSE): row names were found from a
short variable and have been discarded

pc_effects$pwork

Warning: Removed 6545 rows containing missing values or values outside the scale range
(`geom_line()`).

Then, we can run mediation using PC1 as the phenotype and taking the expression of the genes local to the hotspot into account.

## Make a mapping object for the PC and limit it to just the kidney mapping info is included
pc_map <- maps_list
pc_map$expr_list <- list("islet" = t(pc_effects$pca_rz))
pc_map$exprZ_list <- list("islet" = pc_effects$pca_rz)
pc_map$qtlprobs <- list("islet" = pc_map$qtlprobs$islet)
pc_map$kinship_loco <- list("islet" = pc_map$kinship_loco$islet)
pc_map$covar_list <- list("islet" = pc_map$covar_list$islet)
pc_map$tissue_samp <- list("islet" = pc_map$tissue_samp$islet)

chr2_meds <- unique(chr2_hs_med$meds_ranked$mediator_id)

pc_med <- multi_modiFinder(peaks   = list("islet" = pc_effects$peaks), 
                           mapping = pc_map, 
                           exprZ   = maps_list$exprZ_list$islet[, chr2_meds], 
                           suggLOD = round(suggTHLD, 2), 
                           annots  = annot_105, 
                           save    = "ro")

load annotations

checking peaks and mapping

data checked

Number of common samples for islet: 363

filtered peaks

islet

running mediation

Working in Linux and there are 40 cores in total.

Registering 8 cores and passing 8 cores per tissue to 1 tissue(s).

pc_med_plot <- medPlot_single(meds       = pc_med$islet,
                              range      = 5, 
                              position   = pc_effects$peaks$peak_bp[which(pc_effects$peaks$peak_chr == 2)], 
                              feats      = "PC1", 
                              chromosome = 2, 
                              psave      = FALSE, 
                              plot       = "per_drop")

Joining with `by = join_by(target_id, qtl_lod, qtl_chr, mediator, mediator_id,
mediator_chr, mediator_midpoint, LOD, LOD_drop)`

pc_med_plot$meds_ranked |> 
  dplyr::filter(ranks <= 5) |> 
  dplyr::arrange(ranks) |> 
  dplyr::select(mediator, ranks)

# A tibble: 5 × 2
  mediator ranks
  <chr>    <dbl>
1 Hnf4a        1
2 Fitm2        2
3 Slc12a5      3
4 Pabpc1l      4
5 Tomm34       5