Research

Synthetic datasets for enabling genetics-based precision medicine

Unlocking the power of AI for precision medicine requires large and diverse biomedical datasets. As part of the Europe-wide INTEREVENE consortium, collaborators and I developed a new approach to create synthetic datasets for genotypes and phenotypes.

The INTEREVENE consortium is harmonizing data on more than 1.7 million genomes to develop AI-based integrative risk scores for the next generation of predictive and personalised medicine. With the growing availability of larger and more diverse genomics datasets, there comes a need for the development of new computational methods for interpreting this data. However, there are privacy concerns with sharing sensitive patient data across national biobanks, and a need for synthetic datasets to develop and test federated computing infrastructure.

Synthetic data provides a mechanism to generate large and diverse genomics datasets, that preserve key statistical properties of real data without being traced back to real individuals. However, existing methods for simulating genetics datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for genetic risk scoring were lacking.

In a recent Bioinformatics publication, we present HAPNEST, a new approach for efficiently generating and evaluating diverse, individual-level synthetic data for genotypes and phenotypes. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to create a new synthetic data resource of 6.8 million common variants and nine phenotypes for over 1 million individuals. This resource was generated from publicly available reference datasets, and therefore is also publicly available to download and utilise in your work.

To create high-fidelity synthetic datasets that preserve key statistical properties of real genotypes, a resampling approach constructs synthetic genotypes as a series of segments imperfectly copied from a reference panel of real genotypes. To limit overfitting to the reference, we introduce variability where the segments' coalescent age determines genetic distances between cross-over events and age-based mutations. Likelihood-free inference techniques are used to model the unknown parameters in the synthetic data model for different population groups. Phenotypes are simulated using a statistical model of complex and polygenic diseases, where users can customise the heritability, polygenicity, population-specific effects, and genetic correlation and pleiotropy models for multi-trait simulation.

This is implemented as a multi-threaded software program for computational efficiency when generating large synthetic datasets. To evaluate the synthetic data quality, HAPNEST provides an evaluation pipeline for fidelity and generalisability metrics.

Publication details and other resources

See below for links to the full publication, data and software resources. The HAPNEST open-source software program is available at https://github.com/intervene-EU-H2020/synthetic_data. A synthetic data resource of 6.8 million common genetic variants and 9 phenotypes for over 1 million individuals is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The final paper was published in the journal Bioinformatics, and an earlier version of the work was also presented at the NeurIPS Workshop on Synthetic Data for Empowering ML Research.

Journal/conference publication

Journal/conference publication

Journal/conference publication

Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O’Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna, HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes, Bioinformatics, Volume 39, Issue 9, September 2023, https://doi.org/10.1093/bioinformatics/btad535



Tutorial

This tutorial covers how to use the HAPNEST software to generate and evaluate synthetic datasets. If you are more interested in downloading a pre-simulated dataset, please check out our synthetic data resource.

Setup

Prerequisites

To use the HAPNEST software, you will first need either Docker or Singularity installed. Containers provide a simple and portable way to use the HAPNEST software across different platforms, by including all libraries needed to run the application. Docker is the most widely used container engine, but Singularity does not require root privileges and is therefore commonly used on HPC clusters.

Downloading the container

If using Docker:

If using Singularity:

Initialising software dependencies and downloading reference data

To initialise the software dependencies and download the reference dataset, run the init command followed by the fetch command. The data inputs and outputs bind to the data directory (or, data volume, if using Docker).

If using Docker:

docker run -v data:/data/ -it sophiewharrie/intervene-synthetic-data init
docker run -v data:/data/ -it

If using Singularity:

First, we recommend setting up the directory structure pictured below, with a containers directory and data directory. The commands can then be run from the root directory.

singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif init
singularity exec --bind

Please note that the data download may take a while. The init and fetch commands only need to be run once, the first time using the HAPNEST software.

Generating synthetic data

HAPNEST consists of a modular pipeline with separate commands for each of the main functionalities. Given a YAML configuration file, genotypes can be generated using the generate_geno command (optionally, the optimise command can be run first to select optimal hyperparameters). After the genotypes are generated, the corresponding phenotypes can be generated using the generate_pheno command. Finally, data quality evaluation can be carried out using the validate command (however, please note that this operation can be slow for large synthetic datasets).

Example workflow 1 - Generate genotype and phenotype data, and then evaluate the data quality

This workflow takes a config.yaml file as input, generates genotype and phenotype data, and then evaluates the data quality. More details on the config.yaml file are given in the next section.

If using Docker:

docker run -v data:/data/ -it sophiewharrie/intervene-synthetic-data generate_geno 1 config.yaml
docker run -v data:/data/ -it sophiewharrie/intervene-synthetic-data generate_pheno config.yaml
docker run -v data:/data/ -it

If using Singularity:

singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_geno 1 data/config.yaml
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_pheno data/config.yaml
singularity exec --bind

Example workflow 2 - Optimise parameters for genotype generation and then create a genotype dataset

Alternatively, you can first run the optimisation workflow to determine optimal hyperparameter values, and then use those parameters in the config.yaml file to generate the genotypes.

If using Docker:

docker run -v data:/data/ -it sophiewharrie/intervene-synthetic-data optimise config.yaml
# before running the next command, check the results and update the config.yaml file
docker run -v data:/data/ -it sophiewharrie/intervene-synthetic-data generate_geno 1

If using Singularity:

singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif optimise data/config.yaml
# before running the next command, check the results and update the config.yaml file
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_geno 1

Creating a configuration file

The configuration file is a YAML file that allows you to configure global parameters, filepaths, genotype data parameters, and phenotype data parameters, as options for the evaluation pipeline and optimisation procedure.

Configuring global parameters and filepaths

  • chromosome: Setting this to chromosome: all generates data for each chromosome sequentially. It is instead recommended to use specific numbers (from 1 to 22) and a distributed computing setup to generate data more efficiently

  • superpopulation: Local settings override the global settings, so if you want to customise the population settings further then you can do this in the genotype data settings. Otherwise, you can use the global setting to select one superpopulation group or specify superpopulation: none to generate all 6 groups in equal ratio

  • filepaths: Most filepaths can be left as the default values if you are using the provided reference panels. Set an output path output_dir and prefix output_prefix for the generated dataset



Configuring genotype data settings

  • If use_default: true, the default population groups are determined by the superpopulation setting in global parameters. Otherwise, you can add custom groups (see below for examples)

  • Specify rho (recombination rate) and Ne (effective population size) for each of the 6 groups. The optimise command can be used to determine these values automatically for the reference panel

  • When setting nsamples, keep in mind that superpopulation : none generates ~nsamples/6 samples for each of the 6 groups, or if set to a specific group (e.g., superpopulation: AFR) then all nsamples are generated for that group



Custom populations - example 1: 100 genotypes with EUR ancestry and 200 genotypes with AFR ancestry



Custom populations - example 2: 100 genotypes, where each genotype has 50% of segments sampled from EUR reference samples and 50% of segments sampled from AFR reference samples. Please note, that these are not true admixed genotypes because the algorithm does not account for the time of the admixture events



Configuring optimisation settings

The optional optimisation procedure for determining values for rho and Ne uses likelihood free inference.

Simulation-based approach:

  1. Define prior distributions for the unknown parameters (upper and lower bounds of uniform distributions)

  2. Specify the simulation parameters in simulation_rejection_ABC and acceptance threshold threshold

  3. Choose one or both summary_statistics (setting both statistics to true will jointly optimise for both objectives)



An alternative approach is to use the emulation-based approach, which learns a surrogate model with n_design_points to estimate the summary statistic values for different parameter values. This is useful for computationally expensive simulations of large synthetic datasets.

Emulation-based approach:



To help with interpreting the results from either approach, we provide a Jupyter notebook in the Github repository. The image below shows example results from simulation-based rejection sampling for 500 iterations with an acceptance threshold of 0.1. You can use the means of the marginal distributions for rho and Ne as the values in the config.yaml file.

Configuring phenotype data setting

  • nPopulation specifies the number of ancestry groups (6 by default)

  • nTrait is the number of phenotypes you want to simulate

  • Prevalence is the disease prevalence in each of the nPopulation populations

  • a, b and c are parameters reflecting the strength of negative selection on each aspect of the genetic effect model (MAF, LD, functional annotation)

  • nComponent is the number of Gaussian mixture components and CompWeight are the weights for each component. Setting nComponent>1 allows for a more general model where causal SNP effects are stratified into multiple levels, such as small, medium and large effects

  • ProportionGeno is the observed causal SNP heritability in each population and trait

  • For causal variants, either supply a list of causal variants in filepaths or set UseCausalList: false and set the Polygenicity and Pleiotropy values (if nTrait>1)

  • There are additional settings for

    • ProportionCovar: the observed proportion of variance contributed by the covariate for each population and trait

    • TraitCorr: correlation matrix for traits correlation (nTrait x nTrait dimension, must be symmetric positive definite)

    • PopulationCorr: correlation matrix for population genetic correlation (nPopulation x nPopulation dimension, must be symmetric positive definite)



Configuring evaluation settings

Specify which metrics you want to compute, by setting the value to true or false. Please note that some metrics are computationally expensive for large datasets, so you may want to set these to false if you don't need them (e.g., kinship metrics can be slow because they need to compute relatedness between all genotypes).



Large-scale synthetic data generation (HPC cluster)

We used a distributed setup to generate data for 6.8 million common variants and nine phenotypes for over 1 million individuals in less than 12 hours. Our HPC cluster uses the Slurm scheduler system, where the main steps are as follows:

  1. In the config.yaml file, use a wildcard for the chromosome parameter: chromosome: ${chr}

  2. Use an sbatch script to distribute the genotype generation over 22 compute nodes for 22 chromosome (see below)

  3. Use a bash script to run phenotype generation on 1 compute node (because it needs to sum genetic effects over all chromosomes)

Genotype generation script:

#!/bin/bash

#SBATCH --array=1-22
#SBATCH --mem-per-cpu=32G
#SBATCH --cpus-per-task=8
#SBATCH --time 24:00:00

n=$SLURM_ARRAY_TASK_ID

CONFIG=data/config # prefix for config file

# generate a config for each chromosome
cp ${CONFIG}.yaml ${CONFIG}$n.yaml
sed -i 's/${chr}'"/$n/g" ${CONFIG}$n.yaml

# generate data for each chromosome
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_geno 8 ${CONFIG}$n

Phenotype generation script:

#!/bin/bash

CONFIG=data/config # prefix for config file

# replace the chromosome wildcard with all
cp ${CONFIG}.yaml ${CONFIG}_pheno$n.yaml
sed -i 's/${chr}'"/all/g" ${CONFIG}_pheno$n.yaml

# generate phenotype data
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_pheno ${CONFIG}_pheno$n

More from my blog

View All Articles

© 2024 Sophie wharrie