Research
Synthetic datasets for enabling genetics-based precision medicine
Unlocking the power of AI for precision medicine requires large and diverse biomedical datasets. As part of the Europe-wide INTEREVENE consortium, collaborators and I developed a new approach to create synthetic datasets for genotypes and phenotypes.
The INTEREVENE consortium is harmonizing data on more than 1.7 million genomes to develop AI-based integrative risk scores for the next generation of predictive and personalised medicine. With the growing availability of larger and more diverse genomics datasets, there comes a need for the development of new computational methods for interpreting this data. However, there are privacy concerns with sharing sensitive patient data across national biobanks, and a need for synthetic datasets to develop and test federated computing infrastructure.
Synthetic data provides a mechanism to generate large and diverse genomics datasets, that preserve key statistical properties of real data without being traced back to real individuals. However, existing methods for simulating genetics datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for genetic risk scoring were lacking.
In a recent Bioinformatics publication, we present HAPNEST, a new approach for efficiently generating and evaluating diverse, individual-level synthetic data for genotypes and phenotypes. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to create a new synthetic data resource of 6.8 million common variants and nine phenotypes for over 1 million individuals. This resource was generated from publicly available reference datasets, and therefore is also publicly available to download and utilise in your work.
To create high-fidelity synthetic datasets that preserve key statistical properties of real genotypes, a resampling approach constructs synthetic genotypes as a series of segments imperfectly copied from a reference panel of real genotypes. To limit overfitting to the reference, we introduce variability where the segments' coalescent age determines genetic distances between cross-over events and age-based mutations. Likelihood-free inference techniques are used to model the unknown parameters in the synthetic data model for different population groups. Phenotypes are simulated using a statistical model of complex and polygenic diseases, where users can customise the heritability, polygenicity, population-specific effects, and genetic correlation and pleiotropy models for multi-trait simulation.
This is implemented as a multi-threaded software program for computational efficiency when generating large synthetic datasets. To evaluate the synthetic data quality, HAPNEST provides an evaluation pipeline for fidelity and generalisability metrics.
Publication details and other resources
See below for links to the full publication, data and software resources. The HAPNEST open-source software program is available at https://github.com/intervene-EU-H2020/synthetic_data. A synthetic data resource of 6.8 million common genetic variants and 9 phenotypes for over 1 million individuals is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The final paper was published in the journal Bioinformatics, and an earlier version of the work was also presented at the NeurIPS Workshop on Synthetic Data for Empowering ML Research.
Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O’Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna, HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes, Bioinformatics, Volume 39, Issue 9, September 2023, https://doi.org/10.1093/bioinformatics/btad535
Tutorial
This tutorial covers how to use the HAPNEST software to generate and evaluate synthetic datasets. If you are more interested in downloading a pre-simulated dataset, please check out our synthetic data resource.
Setup
Prerequisites
To use the HAPNEST software, you will first need either Docker or Singularity installed. Containers provide a simple and portable way to use the HAPNEST software across different platforms, by including all libraries needed to run the application. Docker is the most widely used container engine, but Singularity does not require root privileges and is therefore commonly used on HPC clusters.
Downloading the container
If using Docker:
If using Singularity:
Initialising software dependencies and downloading reference data
To initialise the software dependencies and download the reference dataset, run the init command followed by the fetch command. The data inputs and outputs bind to the data directory (or, data volume, if using Docker).
If using Docker:
If using Singularity:
First, we recommend setting up the directory structure pictured below, with a containers directory and data directory. The commands can then be run from the root directory.
Please note that the data download may take a while. The init and fetch commands only need to be run once, the first time using the HAPNEST software.
Generating synthetic data
HAPNEST consists of a modular pipeline with separate commands for each of the main functionalities. Given a YAML configuration file, genotypes can be generated using the generate_geno command (optionally, the optimise command can be run first to select optimal hyperparameters). After the genotypes are generated, the corresponding phenotypes can be generated using the generate_pheno command. Finally, data quality evaluation can be carried out using the validate command (however, please note that this operation can be slow for large synthetic datasets).
Example workflow 1 - Generate genotype and phenotype data, and then evaluate the data quality
This workflow takes a config.yaml file as input, generates genotype and phenotype data, and then evaluates the data quality. More details on the config.yaml file are given in the next section.
If using Docker:
If using Singularity:
Example workflow 2 - Optimise parameters for genotype generation and then create a genotype dataset
Alternatively, you can first run the optimisation workflow to determine optimal hyperparameter values, and then use those parameters in the config.yaml file to generate the genotypes.
If using Docker:
If using Singularity:
Creating a configuration file
The configuration file is a YAML file that allows you to configure global parameters, filepaths, genotype data parameters, and phenotype data parameters, as options for the evaluation pipeline and optimisation procedure.
Configuring global parameters and filepaths
chromosome
: Setting this tochromosome: all
generates data for each chromosome sequentially. It is instead recommended to use specific numbers (from 1 to 22) and a distributed computing setup to generate data more efficientlysuperpopulation
: Local settings override the global settings, so if you want to customise the population settings further then you can do this in the genotype data settings. Otherwise, you can use the global setting to select one superpopulation group or specifysuperpopulation: none
to generate all 6 groups in equal ratiofilepaths
: Most filepaths can be left as the default values if you are using the provided reference panels. Set an output pathoutput_dir
and prefixoutput_prefix
for the generated dataset
Configuring genotype data settings
If
use_default: true
, the default population groups are determined by thesuperpopulation
setting in global parameters. Otherwise, you can addcustom
groups (see below for examples)Specify
rho
(recombination rate) andNe
(effective population size) for each of the 6 groups. Theoptimise
command can be used to determine these values automatically for the reference panelWhen setting
nsamples
, keep in mind thatsuperpopulation : none
generates ~nsamples
/6 samples for each of the 6 groups, or if set to a specific group (e.g.,superpopulation: AFR
) then allnsamples
are generated for that group
Custom populations - example 1: 100 genotypes with EUR ancestry and 200 genotypes with AFR ancestry
Custom populations - example 2: 100 genotypes, where each genotype has 50% of segments sampled from EUR reference samples and 50% of segments sampled from AFR reference samples. Please note, that these are not true admixed genotypes because the algorithm does not account for the time of the admixture events
Configuring optimisation settings
The optional optimisation procedure for determining values for rho
and Ne
uses likelihood free inference.
Simulation-based approach:
Define
prior
distributions for the unknown parameters (upper and lower bounds of uniform distributions)Specify the simulation parameters in
simulation_rejection_ABC
and acceptance thresholdthreshold
Choose one or both
summary_statistics
(setting both statistics totrue
will jointly optimise for both objectives)
An alternative approach is to use the emulation-based approach, which learns a surrogate model with n_design_points
to estimate the summary statistic values for different parameter values. This is useful for computationally expensive simulations of large synthetic datasets.
Emulation-based approach:
To help with interpreting the results from either approach, we provide a Jupyter notebook in the Github repository. The image below shows example results from simulation-based rejection sampling for 500 iterations with an acceptance threshold of 0.1. You can use the means of the marginal distributions for rho
and Ne
as the values in the config.yaml
file.
Configuring phenotype data setting
nPopulation
specifies the number of ancestry groups (6 by default)nTrait
is the number of phenotypes you want to simulatePrevalence
is the disease prevalence in each of thenPopulation
populationsa
,b
andc
are parameters reflecting the strength of negative selection on each aspect of the genetic effect model (MAF, LD, functional annotation)nComponent
is the number of Gaussian mixture components andCompWeight
are the weights for each component. SettingnComponent>1
allows for a more general model where causal SNP effects are stratified into multiple levels, such as small, medium and large effectsProportionGeno
is the observed causal SNP heritability in each population and traitFor causal variants, either supply a list of causal variants in
filepaths
or setUseCausalList: false
and set thePolygenicity
andPleiotropy
values (ifnTrait>1
)There are additional settings for
ProportionCovar
: the observed proportion of variance contributed by the covariate for each population and traitTraitCorr
: correlation matrix for traits correlation (nTrait x nTrait
dimension, must be symmetric positive definite)PopulationCorr
: correlation matrix for population genetic correlation (nPopulation x nPopulation
dimension, must be symmetric positive definite)
Configuring evaluation settings
Specify which metrics you want to compute, by setting the value to true
or false
. Please note that some metrics are computationally expensive for large datasets, so you may want to set these to false
if you don't need them (e.g., kinship
metrics can be slow because they need to compute relatedness between all genotypes).
Large-scale synthetic data generation (HPC cluster)
We used a distributed setup to generate data for 6.8 million common variants and nine phenotypes for over 1 million individuals in less than 12 hours. Our HPC cluster uses the Slurm scheduler system, where the main steps are as follows:
In the
config.yaml
file, use a wildcard for the chromosome parameter:chromosome: ${chr}
Use an sbatch script to distribute the genotype generation over 22 compute nodes for 22 chromosome (see below)
Use a bash script to run phenotype generation on 1 compute node (because it needs to sum genetic effects over all chromosomes)
Genotype generation script:
Phenotype generation script: