On this page we detail the quality control (QC) pipeline for the UK Biobank exomes before starting our analyses. Further plots and the underlying code can be found on the SAIGE gene munging github repository.
For our QC pipeline, we first perform a collection of careful QC steps. The
initial step in this process is to read in the .vcf
files,
split multiallelics and realign indels, and calculate a collection of
sample-level statistics.
Our next step (after conversion of the joint called .vcf
file to a hail matrix table) is to remove genotypes based on the
following collection of criteria:
Remove variants that either:
Following this initial curation we perform a series of further QC steps detailed in this repository.
We run the sample_qc function in hail and remove samples according to the following:
Thresholds used were based on plotting the distributions of these metrics. Here we show boxplots with overlaid scatterplots of the above metrics, split by UKBB centre, and coloured by sequencing batch. The threshold for exclusion is shown as a dashed line.
Filter | Samples | % |
---|---|---|
Samples after initial filter | 418,045 | 100.0 |
Sample call rate | 5,537 | 1.3 |
Mean DP | 0 | 0.0 |
Mean GQ | 0 | 0.0 |
Samples after sample QC filters | 412,508 | 98.7 |
Following this step, we export genotyped variants on the X chromosome
(allele frequency between 0.05 to 0.95 with high call rate (> 0.98))
to plink format and prune to pseudo-independent SNPs using
--indep 50 5 2
. This pruned set of SNPs feeds into the sex
imputation step.
We next perform a number of principal component analysis (PCA) steps to label with broad scale ancestry.
We first run PCA on the 1000 genomes samples (minus the small subset of related individuals within 1000 genomes). We then project in the UK Biobank samples, ensuring that we correctly account for shrinkage bias in the projection.
We then train a classifier to assign 1000G population labellings. To do this, we train a random forest on the super populations labels of 1000 genomes and predict the super population for each of the UK Biobank samples. We denote strictly defined European subset as those with probability \(>\) 0.99 of being European according to the classifier. UK Biobank samples are coloured by their assignment or unsure if none of the classifier probabilities exceeded 0.99 in the following plots.
Here is the loose classifier, with no probability restriction on classification:
And here are the resultant classifications after imposing a probability cutoff of 0.99.
Subsequent filters are applied stratified by 1000G population labelling.
We impute the sexes of the individuals with this pruned set of variants on the X chromosome, and create list of samples with incorrect or unknown sex as defined by:
Here we show the distribution of the F-statistic, with the 0.2 and 0.8 thresholds defining our sex impututation shown as a dashed lines.
Label | Before | Removed | After | % |
---|---|---|---|---|
AFR | 6,613 | 12 | 6,601 | 99.8 |
AMR | 496 | 7 | 489 | 98.6 |
EAS | 1,651 | 1 | 1,650 | 99.9 |
EUR | 396,901 | 219 | 396,682 | 99.9 |
SAS | 6,432 | 30 | 6,402 | 99.5 |
Total | 412,508 | 269 | 412,239 | 99.9 |
For our final variant filtering step, we first restrict to samples with strictly defined 1000G labellings, filter out samples with incorrectly defined sex or unknown sex, and run variant QC. Stratifying by 1000G labelling, we then evaluate a collection of variant metrics and remove variants that satisfy at least one of:
The following plots shows the 0.97 threshold for call rate.
Filter | Variants | % |
---|---|---|
Variants after initial filter | 18,992,636 | 100.0 |
Invariant sites after sample filters | 16,364,165 | 86.2 |
Overall variant call rate | 281,328 | 1.5 |
Variants failing HWE filter | 6,614 | 0.0 |
Variants after filters | 2,345,760 | 12.4 |
Filter | Variants | % |
---|---|---|
Variants after initial filter | 18,992,636 | 100.0 |
Invariant sites after sample filters | 18,336,374 | 96.5 |
Overall variant call rate | 78,151 | 0.4 |
Variants failing HWE filter | 1,100 | 0.0 |
Variants after filters | 576,716 | 3.0 |
Filter | Variants | % |
---|---|---|
Variants after initial filter | 18,992,636 | 100.0 |
Invariant sites after sample filters | 18,001,198 | 94.8 |
Overall variant call rate | 107,126 | 0.6 |
Variants failing HWE filter | 2,391 | 0.0 |
Variants after filters | 883,080 | 4.6 |
Filter | Variants | % |
---|---|---|
Variants after initial filter | 18,992,636 | 100.0 |
Invariant sites after sample filters | 3,216,967 | 16.9 |
Overall variant call rate | 1,485,220 | 7.8 |
Variants failing HWE filter | 23,574 | 0.1 |
Variants after filters | 14,284,267 | 75.2 |
Filter | Variants | % |
---|---|---|
Variants after initial filter | 18,992,636 | 100.0 |
Invariant sites after sample filters | 16,671,965 | 87.8 |
Overall variant call rate | 232,199 | 1.2 |
Variants failing HWE filter | 4,939 | 0.0 |
Variants after filters | 2,087,425 | 11.0 |
After these steps we plot the resulting changes in metrics across the samples in our data set. Each of this first set plots splits the data by 1000G label. The first collection of subplots in each figure shows the variant metrics before sample removal, with the lower collection of subplots showing the resultant change after our QC steps.
In this step we remove sample outliers after the variant cleaning in the previous step. Samples are removed if at least one of the following lies more than 4 standard deviations away from the mean (corresponding to a MAD threshold of 5.9304):
Filter | Samples | % |
---|---|---|
Samples after population filters | 6,601 | 100 |
Within batch Ti/Tv ratio outside 5.9304 standard deviations | 0 | 0 |
Within batch Het/HomVar ratio outside 5.9304 standard deviations | 0 | 0 |
Within batch Insertion/Deletion ratio outside 5.9304 standard deviations | 0 | 0 |
n singletons > 20 median absolute deviations | 0 | 0 |
Samples after final sample filters | 6,601 | 100 |
Filter | Samples | % |
---|---|---|
Samples after population filters | 489 | 100 |
Within batch Ti/Tv ratio outside 5.9304 standard deviations | 0 | 0 |
Within batch Het/HomVar ratio outside 5.9304 standard deviations | 0 | 0 |
Within batch Insertion/Deletion ratio outside 5.9304 standard deviations | 0 | 0 |
n singletons > 20 median absolute deviations | 0 | 0 |
Samples after final sample filters | 489 | 100 |
Filter | Samples | % |
---|---|---|
Samples after population filters | 1,650 | 100 |
Within batch Ti/Tv ratio outside 5.9304 standard deviations | 0 | 0 |
Within batch Het/HomVar ratio outside 5.9304 standard deviations | 0 | 0 |
Within batch Insertion/Deletion ratio outside 5.9304 standard deviations | 0 | 0 |
n singletons > 20 median absolute deviations | 0 | 0 |
Samples after final sample filters | 1,650 | 100 |
Filter | Samples | % |
---|---|---|
Samples after population filters | 396,682 | 100.0 |
Within batch Ti/Tv ratio outside 5.9304 standard deviations | 0 | 0.0 |
Within batch Het/HomVar ratio outside 5.9304 standard deviations | 0 | 0.0 |
Within batch Insertion/Deletion ratio outside 5.9304 standard deviations | 0 | 0.0 |
n singletons > 20 median absolute deviations | 277 | 0.1 |
Samples after final sample filters | 396,405 | 99.9 |
Filter | Samples | % |
---|---|---|
Samples after population filters | 6,402 | 100 |
Within batch Ti/Tv ratio outside 5.9304 standard deviations | 0 | 0 |
Within batch Het/HomVar ratio outside 5.9304 standard deviations | 0 | 0 |
Within batch Insertion/Deletion ratio outside 5.9304 standard deviations | 0 | 0 |
n singletons > 20 median absolute deviations | 0 | 0 |
Samples after final sample filters | 6,402 | 100 |
After all of this data cleaning, we save the resultant data for downstream analyses.
The composition of the samples according to 1000G labelling was as follows:
Label | Count |
---|---|
AFR | 6,601 |
AMR | 489 |
EAS | 1,650 |
EUR | 396,405 |
SAS | 6,402 |
Total | 411,547 |