On this page we detail the quality control (QC) pipeline for the UK Biobank exomes before starting our analyses. Further plots and the underlying code can be found on the SAIGE gene munging github repository.

For our QC pipeline, we first perform a collection of careful QC steps. The initial step in this process is to read in the .vcf files, split multiallelics and realign indels, and calculate a collection of sample-level statistics.


Initial sample filtering

  • Filter out samples based on MAD thresholds.


Initial genotype filtering

Our next step (after conversion of the joint called .vcf file to a hail matrix table) is to remove genotypes based on the following collection of criteria:

  • Remove if at least one of the following is true:
    • Genotype quality \(<\) 20
    • Depth \(<\) 10
  • If heterozygous and a SNP:
    • P-value from 1-sided binomial test of alt allele read depth relative to ref+alt read depth \(<\) 10-3
  • If heterozygous and an indel:
    • Alternative allele depth divided by total depth \(<\) 0.3


Initial variant filtering

Remove variants that either:

  • Are invariant after the initial GT filter
  • Fall in a low complexity region
  • Fall outside padded target intervals (50bp padding)
  • Have GATK ExcessHet > 54.69

Following this initial curation we perform a series of further QC steps detailed in this repository.

We run the sample_qc function in hail and remove samples according to the following:

  • Sample call rate \(<\) 0.95
  • Mean depth \(<\) 19.5
  • Mean genotype quality \(<\) 47.8

Thresholds used were based on plotting the distributions of these metrics. Here we show boxplots with overlaid scatterplots of the above metrics, split by UKBB centre, and coloured by sequencing batch. The threshold for exclusion is shown as a dashed line.

Filter Samples %
Samples after initial filter 418,045 100.0
Sample call rate 5,537 1.3
Mean DP 0 0.0
Mean GQ 0 0.0
Samples after sample QC filters 412,508 98.7

Following this step, we export genotyped variants on the X chromosome (allele frequency between 0.05 to 0.95 with high call rate (> 0.98)) to plink format and prune to pseudo-independent SNPs using --indep 50 5 2. This pruned set of SNPs feeds into the sex imputation step.


PCA to assign broad scale ancestry based on 1000 genomes labels

We next perform a number of principal component analysis (PCA) steps to label with broad scale ancestry.

We first run PCA on the 1000 genomes samples (minus the small subset of related individuals within 1000 genomes). We then project in the UK Biobank samples, ensuring that we correctly account for shrinkage bias in the projection.

We then train a classifier to assign 1000G population labellings. To do this, we train a random forest on the super populations labels of 1000 genomes and predict the super population for each of the UK Biobank samples. We denote strictly defined European subset as those with probability \(>\) 0.99 of being European according to the classifier. UK Biobank samples are coloured by their assignment or unsure if none of the classifier probabilities exceeded 0.99 in the following plots.

Here is the loose classifier, with no probability restriction on classification:

Following the Strict Cutoff

And here are the resultant classifications after imposing a probability cutoff of 0.99.

Subsequent filters are applied stratified by 1000G population labelling.

Sex imputation

We impute the sexes of the individuals with this pruned set of variants on the X chromosome, and create list of samples with incorrect or unknown sex as defined by:

  • Sex is unknown in the phenotype files
  • F-statistic \(>\) 0.8 and the sex is female in the phenotype file
  • F-statistic \(<\) 0.2 and the sex is male in the phenotype file
  • F-statistic \(>\) 0.8 and number of calls on the Y is \(<\) 100.
  • F-statistic lies in the interval (0.2, 0.8).

Here we show the distribution of the F-statistic, with the 0.2 and 0.8 thresholds defining our sex impututation shown as a dashed lines.

AFR

AMR

EAS

EUR

SAS

Label Before Removed After %
AFR 6,613 12 6,601 99.8
AMR 496 7 489 98.6
EAS 1,651 1 1,650 99.9
EUR 396,901 219 396,682 99.9
SAS 6,432 30 6,402 99.5
Total 412,508 269 412,239 99.9


Final variant filtering

For our final variant filtering step, we first restrict to samples with strictly defined 1000G labellings, filter out samples with incorrectly defined sex or unknown sex, and run variant QC. Stratifying by 1000G labelling, we then evaluate a collection of variant metrics and remove variants that satisfy at least one of:

  • Invariant site in cleaned sample subset
  • Call rate \(<\) 0.97
  • P-HWE \(<\) 10-6

The following plots shows the 0.97 threshold for call rate.

AFR

Filter Variants %
Variants after initial filter 18,992,636 100.0
Invariant sites after sample filters 16,364,165 86.2
Overall variant call rate 281,328 1.5
Variants failing HWE filter 6,614 0.0
Variants after filters 2,345,760 12.4

AMR

Filter Variants %
Variants after initial filter 18,992,636 100.0
Invariant sites after sample filters 18,336,374 96.5
Overall variant call rate 78,151 0.4
Variants failing HWE filter 1,100 0.0
Variants after filters 576,716 3.0

EAS

Filter Variants %
Variants after initial filter 18,992,636 100.0
Invariant sites after sample filters 18,001,198 94.8
Overall variant call rate 107,126 0.6
Variants failing HWE filter 2,391 0.0
Variants after filters 883,080 4.6

EUR

Filter Variants %
Variants after initial filter 18,992,636 100.0
Invariant sites after sample filters 3,216,967 16.9
Overall variant call rate 1,485,220 7.8
Variants failing HWE filter 23,574 0.1
Variants after filters 14,284,267 75.2

SAS

Filter Variants %
Variants after initial filter 18,992,636 100.0
Invariant sites after sample filters 16,671,965 87.8
Overall variant call rate 232,199 1.2
Variants failing HWE filter 4,939 0.0
Variants after filters 2,087,425 11.0

After these steps we plot the resulting changes in metrics across the samples in our data set. Each of this first set plots splits the data by 1000G label. The first collection of subplots in each figure shows the variant metrics before sample removal, with the lower collection of subplots showing the resultant change after our QC steps.

Number of Singletons

Ratio of Heterozygous to Homozygous Variants


Final sample filtering

In this step we remove sample outliers after the variant cleaning in the previous step. Samples are removed if at least one of the following lies more than 4 standard deviations away from the mean (corresponding to a MAD threshold of 5.9304):

  • Ratio of transitions to transversions
  • Ratio of heterozygous to homozygous variant
  • Ratio of insertions to deletions
Filter Samples %
Samples after population filters 6,601 100
Within batch Ti/Tv ratio outside 5.9304 standard deviations 0 0
Within batch Het/HomVar ratio outside 5.9304 standard deviations 0 0
Within batch Insertion/Deletion ratio outside 5.9304 standard deviations 0 0
n singletons > 20 median absolute deviations 0 0
Samples after final sample filters 6,601 100

AMR

Filter Samples %
Samples after population filters 489 100
Within batch Ti/Tv ratio outside 5.9304 standard deviations 0 0
Within batch Het/HomVar ratio outside 5.9304 standard deviations 0 0
Within batch Insertion/Deletion ratio outside 5.9304 standard deviations 0 0
n singletons > 20 median absolute deviations 0 0
Samples after final sample filters 489 100

EAS

Filter Samples %
Samples after population filters 1,650 100
Within batch Ti/Tv ratio outside 5.9304 standard deviations 0 0
Within batch Het/HomVar ratio outside 5.9304 standard deviations 0 0
Within batch Insertion/Deletion ratio outside 5.9304 standard deviations 0 0
n singletons > 20 median absolute deviations 0 0
Samples after final sample filters 1,650 100

EUR

Filter Samples %
Samples after population filters 396,682 100.0
Within batch Ti/Tv ratio outside 5.9304 standard deviations 0 0.0
Within batch Het/HomVar ratio outside 5.9304 standard deviations 0 0.0
Within batch Insertion/Deletion ratio outside 5.9304 standard deviations 0 0.0
n singletons > 20 median absolute deviations 277 0.1
Samples after final sample filters 396,405 99.9

SAS

Filter Samples %
Samples after population filters 6,402 100
Within batch Ti/Tv ratio outside 5.9304 standard deviations 0 0
Within batch Het/HomVar ratio outside 5.9304 standard deviations 0 0
Within batch Insertion/Deletion ratio outside 5.9304 standard deviations 0 0
n singletons > 20 median absolute deviations 0 0
Samples after final sample filters 6,402 100


Final Sample Counts

After all of this data cleaning, we save the resultant data for downstream analyses.

The composition of the samples according to 1000G labelling was as follows:


Label Count
AFR 6,601
AMR 489
EAS 1,650
EUR 396,405
SAS 6,402
Total 411,547