library(QTLretrievR)
QTL Identification and Analysis with QTLretrievR
Advances in molecular phenotyping technologies, including quantitative transcriptomics and proteomics (“-omics”), have transformed biomedical research over the past two decades by providing a granular understanding of the gene regulatory networks that drive disease. Many studies have demonstrated the additional power of combining -omics profiling with genetic mapping, an integrative approach referred to as “systems genetics”, to link population variability in complex disease phenotypes back to genetic variants and their direct effects on gene regulation (e.g. gene expression). However, these quantitative trait locus (QTL) analyses are often computationally intensive and intimidating to those without a strong computational background — especially when samples originate from multiple cell/tissue types or involve thousands of molecular phenotypes. We have developed a new R package, QTLretrievR, that aims to simplify and streamline the identification and downstream analyses of molecular QTL from genetically diverse populations such as Diversity Outbred or Collaborative Cross mice. QTLretrievR leverages existing R packages used in complex trait analysis, including the popular qtl2
package for QTL peak calling and haplotype effect identification and the intermediate
package for peak mediation, and employs dynamic nested parallelism to enhance the efficiency of each analysis step. Future developments will enhance the visualization functionality and data analysis options, including adding network inference methods and expanding the software to work with other mapping populations.
Quick start
This is the easiest way to run QTLretrievR. There are several input files required for runQTL
as described in detail below. This function will return the list of all the objects used in QTL mapping, annotated QTL peaks, QTL effects, and mediation results. Note in the example below that the objects will be only be returned, not saved. If you wish to have the objects saved for later use you can change save
to “so” (save only), or “sr” (save and return).
## Passing a Genoprobs object - list of probabilities for each tissue
## Using internal data for expr, gridFile, metadata, annots
## Unevaluated Code Chunk
<- runQTL(outdir = "../../vignette/",
qtl_res gbrs_fileLoc = list("mESC" = demo_probs),
geno_out = "mESC_mm10_probs.rds",
peaks_out = "mESC_mm10_peaks.rds",
map_out = "mESC_mm10_map.rds",
med_out = "mESC_mm10_mediation.rds",
effects_out = "mESC_mm10_effects.rds",
expr_mats = list("mESC" = demo_counts),
tissues = "mESC",
gridFile = gridfile69k,
covar_factors = c("sex"),
metadata = demo_meta,
annots = demo_annot,
save_t = "ro")
This can be alternatively submitted as follows:
## Unevaluated Code Chunk
<- runQTL(outdir = <path/to/output/dir>,
qtl_res gbrs_fileLoc = <path/to/gbrs/files>,
geno_out = "Name_for_genotype_prob_file.rds",
peaks_out = "Name_for_QTL_peaks.rds",
map_out = "Name_for_mapping_objects.rds",
med_out = "Name_for_mediation_Results.rds",
effects_out = "Name_for_QTL_effects.rds",
expr_mats = c(<path/to/expr_mat1>, <path/to/expr_mat2>),
tissues = "Tissue_name",
gridFile = <path/to/gridfile>,
covar_factors = c("Covariates"),
metadata = <path/to/sample/metadata>,
annots = <path/to/annotations>,
save = "ro"
)
This returns the mapping object, original peaks object, mediation object, and effects object. The peaks, effects and mediation objects can be used for additional plotting and analysis.
Inputs Data
There are several types of data that QTLretrievR
takes as input, below we describe variations on the required and optional data inputs.
Genotype Probabilities
Genotype Probabilities (genoprobs) can be passed to QTLretreivR
in a few different ways.
Passed as a string to the location of the .tsv probability files from the GBRS software. This can be the folder where all the GBRS results are kept, the files should end with
.gbrs.interpolated.genoprobs.tsv
. This option would be used withgenoprobably
andrunQTL
. This is probably the easiest option if you are starting from scratch and only have GBRS genotypes.Passed as a two dimensional array of the genotype probabilities for all the samples. See the
genotype_probs.Rds
file here for an example of how this should look. This option is for passing torunQTL
.Passed as a
qtl2
formatted genoprobs object. This option is for passing torunQTL
andmapQTL
. An example of this is shown below.
## Structure and example of qtl2 fomatted genoprobs
names(demo_probs)
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
#> [16] "16" "17" "18" "19" "X"
str(demo_probs[[1]])
#> num [1:10, 1:8, 1:4711] 1.48e-08 1.23e-08 1.04e-08 3.31e-08 2.94e-07 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ : chr [1:10] "PB357.02_repA" "PB357.12_repA" "PB357.16_repA" "PB357.18_repA" ...
#> ..$ : chr [1:8] "A" "B" "C" "D" ...
#> ..$ : chr [1:4711] "1_3000000" "1_3041392" "1_3346528" "1_3651663" ...
1]][1:2,,1:2]
demo_probs[[#> , , 1_3000000
#>
#> A B C D E
#> PB357.02_repA 1.477899e-08 0.5 1.217364e-08 1.370868e-08 1.199720e-08
#> PB357.12_repA 1.232415e-08 0.5 5.000000e-01 1.604246e-08 3.499095e-09
#> F G H
#> PB357.02_repA 5.000000e-01 4.331041e-09 1.313560e-08
#> PB357.12_repA 1.478095e-08 4.742920e-09 1.118048e-08
#>
#> , , 1_3041392
#>
#> A B C D E
#> PB357.02_repA 1.477899e-08 0.5 1.217364e-08 1.370868e-08 1.199720e-08
#> PB357.12_repA 1.232415e-08 0.5 5.000000e-01 1.604246e-08 3.499095e-09
#> F G H
#> PB357.02_repA 5.000000e-01 4.331041e-09 1.313560e-08
#> PB357.12_repA 1.478095e-08 4.742920e-09 1.118048e-08
If you have genotyping from one of the Mouse Universal Genotyping Arrays (MUGA), then you can format your probabilities using our mugaprobs
function, a detailed example is below in the Genoprobs section of this document.
Quantified Phenotypes
Although the variable is named expr_mats
any quantified molecular data can be passed to this variable. They can be passed as a vector of paths to the data (in the same order as tissues are being passed), or as a named list of matrices (named with the tissue type associated with the counts). These should be filtered, batch corrected and normalized data with the phenotype in rows and the samples in columns, there is no need to rankZ transform your counts, that step occurs in mapQTL
. See the attached demo_counts
for an example of formatting.
## Information and partial example of quantified data
class(demo_counts)
#> [1] "matrix" "array"
head(demo_counts)
#> PB357.02_repA PB357.12_repA PB357.16_repA PB357.18_repA
#> ENSMUSG00000030402 90.27620 47.49881 65.92215 60.68472
#> ENSMUSG00000025473 55.33058 61.24847 59.68627 26.61611
#> ENSMUSG00000029246 2333.74400 5135.62454 2341.10901 2701.05069
#> ENSMUSG00000051413 218.41017 162.49593 209.34736 223.57530
#> ENSMUSG00000026356 3244.11907 11302.21694 2892.56671 4801.55973
#> ENSMUSG00000042505 317.42278 554.98610 316.24815 300.22969
#> PB357.19_repA PB357.20_repA PB357.21_repA PB357.22_repA
#> ENSMUSG00000030402 50.6948 43.98085 68.98128 57.27735
#> ENSMUSG00000025473 24.3725 45.02802 55.41119 37.55891
#> ENSMUSG00000029246 3481.3853 2611.54250 1373.65252 3451.91477
#> ENSMUSG00000051413 230.0764 158.12164 211.46720 217.84171
#> ENSMUSG00000026356 5616.3992 4653.59323 2312.56906 5754.02576
#> ENSMUSG00000042505 348.0393 347.65818 360.73816 294.83748
#> PB357.28_repA PB357.29_repA
#> ENSMUSG00000030402 42.11575 52.36949
#> ENSMUSG00000025473 14.03858 44.38093
#> ENSMUSG00000029246 3130.74325 2534.28661
#> ENSMUSG00000051413 161.91166 260.95984
#> ENSMUSG00000026356 6941.61141 4452.29448
#> ENSMUSG00000042505 440.81151 286.70078
Marker Grid
This should be a grid of markers and the associated genomic locations in bp and cM. The grid file can be passed as an object, or as a path to where the file is located. There are two grid files attached to QTLretrievR
, a grid with ~75K pseudo-markers with positions updated for mm38 and a grid with ~69K pseudo-markers with positions associated with mm10, and the GigaMUGA and MegaMUGA markers and positions. Or you can use a new marker grid to be used with your probabilities. If making your own grid file, see gridfile
to make sure you have the correct format.
## example format for 75K marker grid
head(gridfile)
#> chr pos cM
#> 1_3000000 1 3000000 0.00001
#> 1_3039563 1 3039563 0.02001
#> 1_3079126 1 3079126 0.04001
#> 1_3118689 1 3118689 0.06001
#> 1_3158252 1 3158252 0.08001
#> 1_3197814 1 3197814 0.10001
Metadata
At a minimum, the metadata file should contain a column for sample IDs (ID
) and columns for any covariates to be included in the peak and effect calling. Any other pieces of information that may be relevant to your analysis that you want to make sure is all in the same place should also be included. A minimal example, demo_meta
, is attached to QTLretrievR
.
## minimal metadata example
head(demo_meta[which(demo_meta$ID %in% colnames(demo_counts)),])
#> ID sex
#> 1 PB357.02_repA F
#> 2 PB357.12_repA F
#> 3 PB357.16_repA F
#> 4 PB357.18_repA F
#> 5 PB357.19_repA F
#> 6 PB357.20_repA F
Annotations
Currently the annotations needs to contain at a minimum id
, symbol
, chr
, start
, and end
columns. The id
column should contain the phenotype information that appears in the row names of the quantified data. In the attached example (demo_annot
) the id column contains ensembl gene IDs, these could be traded out for ensembl protein IDs or any other phenotype identifier. The demo_annot
file is the Ensemblv84 biomart annotations, if you are using newer expression data use annot_105
or upload your own annotations
## Example annotation file - Note this file is all the genes for the Ensemblv84 build
head(demo_annot)
#> id symbol chr start end
#> 1 ENSMUSG00000000001 Gnai3 3 108107280 108146146
#> 2 ENSMUSG00000000028 Cdc45 16 18780447 18811987
#> 3 ENSMUSG00000000031 H19 7 142575529 142578143
#> 4 ENSMUSG00000000037 Scml2 X 161117193 161258213
#> 5 ENSMUSG00000000056 Narf 11 121237253 121255856
#> 6 ENSMUSG00000000058 Cav2 6 17281185 17289115
Running as Separate Steps
If, instead of using the wrapper to run all steps at once, you wish to run each step individually (calculating/formatting the genotype probabilities, mapping/peak calling, mediation, effects) this can be done with relative ease.
Genoprobs
In order to convert the genoprobs from the GBRS output to the qtl2
format we run a function called genoprobably
. It pulls in the relevant files from the provided location and splits samples into the tissues they are derived from (the names that you pass to tissue
should be included in the file name in some form), then puts these into a 3 dimensional format based on the markers. Once the genotype probabilities are in the 3D format probs_3d_to_qtl2
is used to convert them to the qtl2 format. The below example shows how to run genoprobably with the 75k gridfile.
## Unevaluated Code Chunk
<- genoprobably(outfile = "../../vignette/gbrs_interpolated_probs.rds",
probs gbrsFileLoc = "<path/to/gbrs/results>",
tissues = c("tissue1","tissue2"),
gridFile = gridfile,
save = "ro")
Alternatively you can use genotyping from either GigaMUGA or MegaMUGA as your genome probabilities. In order to convert these reports to the qtl2
format we run a function called mugaprobs
. It loads the appropriate grid files, builds the control file, and processes the genotyping file(s) into the correct allele based format for processing. The below example shows how to run mugaprobs
from a sample GigaMUGA genotyping run.
## Unevaluated Code Chunk
<- mugaprobs(type = "GM", # Did your genotyping come from GigaMUGA or MegaMUGA?
probs covarLoc = "/path/to/covariate/file/",
covar_file = "name_of_covariate_file.tsv",
i.files = c("final_report_1", "final_report_2"), # If you already have chromosome specific genotype files you can pass that directory as a string here
genoPrefix = "gm4qtl2",
probsOut = "muga_interpolated_genoprobs.rds", # This is the file that your probabilites will be saved to
tissues = c("tissue1", "tissue2")) # All probabilities will be attributed to each tissue, this is just to put it in the format needed for `mapQTL`
Mapping and Peak Calling
The mapping and peak calling steps are combined in our pipeline. We use the genoprobs to calculate kinship, and create the genetic and pysical maps. Then using the quantified data and sample metadata we rankZ normalize the counts, calculate covariates and finally do the peak calling using qtl2::scan1
.
## Unevaluated Code Chunk
## Passing objects:
<- mapQTL(outdir = "../../vignette/",
map_peaks peaks_out = "mm39_peaks.rds",
map_out = "mm39_map.rds",
genoprobs = probs,
samp_meta = demo_meta, # Make sure that your ID column is named "ID"
expr_mats = list("mESC" = demo_counts), # Pass your counts as a named list
covar_factors = c("sex"),
gridFile = gridfile,
annots = demo_annot, # See the Annotations section above for notes on this
save = "ro") # return only - other options are "so" (save only) and "sr" (save and return)
## Passing file locations
<- mapQTL(outdir = "../../vignette/",
map_peaks peaks_out = "mm39_peaks.rds",
map_out = "mm39_map.rds",
genoprobs = "path/to/saved/probs.rds",
samp_meta = "path/to/sample/metadata",
expr_mats = c("path/to/counts1","path/to/coounts/2"),
covar_factors = c("sex"),
gridFile = "path/to/custom/grid",
annots = "path/to/annotations",
save = "ro")
Mediation
For mediation we use the peaks, mapping information (which includes the rankZ normalized counts), and annotations to identify targets which in turn allows us to identify phenotypes that act as mediators for peaks above a suggestive LOD. This is done using intermediate::mediation.scan
.
## Unevaluated Code Chunk
<- run_mediate(peaks = map_peaks$peaks_list,
med_res mapping = map_peaks$maps_list,
outdir = "../../vignette",
annots = demo_annot,
suggLOD = 7,
med_out = "mm39_mediation.rds",
save = "ro")
Effects
To identify the founder effects at peaks of interest, we use the peaks, and mapping information along with a suggestive LOD threshold to ensure that the reported effects are for relevant peaks only. This is done using qtl2::scan1blup
.
## Unevaluated Code Chunk
<- qtl_effects(mapping = map_peaks$maps_list,
effects peaks = map_peaks$peaks_list,
suggLOD = 7,
outdir = "../../vignette",
outfile = "mm39_effects.rds",
save = "ro")
Submitting to a Job Scheduler
Currently, QTLretrievR
is set up to identify available cores in a linux, mac, or windows environment. It is also able to identify the number of cores if you are submitting a job through a SLURM scheduler. Just make sure to include both of the following lines in your SBATCH
:
#SBATCH --ntasks=N
#SBATCH --cpus-per-task=X
When the number of cores is calculated for parallelization purposes, the total number of cores that are identified are N*X
. If you don’t include both, and are submitting with a SLURM scheduler, it may not identify the correct number of cores.
If you are using any other scheduler, make sure that you specify the total number of cpus for the whole job. If applicable the number of processes should be 1.