Synthetic data for human genomics

Author

Benjamin Wingfield

Published

November 14, 2024

Modified

December 17, 2024

Why are synthetic genomes useful?

Human genetic data is extremely sensitive, and is special category data under the General Data Protection Regulation (GDPR). Human genomes are available from projects like:

However, this data is small (low thousands of individuals) and often lacks linked trait information for privacy and ethical reasons.

When people develop polygenic scores (PGS), they need data from hundreds of thousands of individuals with linked healthcare records. This type of data, available from places like UK Biobank or FinnGen, is not publicly available. It can only be accessed by vetted researchers that have made a research plan and submitted a formal application to the relevant authority.

Realistic synthetic genomes can be useful for people looking to get started developing data science models or methods that use human genetic variation.

What is a genetic variant?

EMBL-EBI host a free online training course describing the basics of human genetic variation.

Synthetic human genotype and phenotype data from HAPNEST

HAPNEST is a program for simulating large-scale, diverse, and realistic datasets for genotypes and phenotypes. A manuscript describing HAPNEST in detail is available at:

Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O’Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna, HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes, Bioinformatics, Volume 39, Issue 9, September 2023, btad535, https://doi.org/10.1093/bioinformatics/btad535

The creators of HAPNEST made a large synthetic dataset freely available for others to reuse and experiment with.

How can I access the synthetic data?

The HAPNEST synthetic data is available from BioStudies (accession: S-BSST936).

Synthetic dataset summary
Dataset	Genetic variants	Individuals	Genetic ancestry groups	Phenotypes	Genome build
Large	6,800,000	1,008,000	6	9	GRCh38
Small	6,800,000	600	6	9	GRCh38

The HAPNEST data includes 6.8 million genetic variants and 9 continuous phenotypic traits. Individuals in the dataset cover 6 genetic ancestry groups (AFR, AMR, CSA, EAS, EUR, MID). The creators of the dataset defined genetic ancestry groups using super-population labels from the largest open access globally representative reference panels: 1000 Genomes (1000G) and Human Genome Diversity Project (HGDP). The full dataset including 1,008,000 individuals is several terabytes, so a smaller dataset including 600 individuals is available for testing. Data from BioStudies can be browsed and downloaded using a web browser but better alternatives exist for the largest files.

How can I download very large files from BioStudies?

BioStudies offer multiple ways to download lots of data, including:

Globus
Aspera
FTP
HTTPS using CLI tools like wget or curl

Are there restrictions on data use?

The data are licensed under CC0. There are no restrictions on data use or redistribution. If you use the synthetic data in academic work, it’s polite to cite the original publication.

What are PLINK 1 (bed / bim/ fam) files?

The synthetic dataset are in PLINK 1 binary format. PLINK is a whole genome association analysis toolset - the “swiss army knife of human genomics”. A PLINK bfile is composed of three files:

Note

The PLINK 1 format can only represent hard-called biallelic variants. Real human genetic variation can be much more complex than this, but it’s a helpful abstraction to start with.

If you want to learn more, the PLINK 2 file format is capable of representing more complex types of genetic variation.

Warning

PLINK1 format doesn’t understand the concept of reference and alternate alleles, only A1/A2 (minor/major) alleles
This means that alleles will sometimes swap if the reference allele is less common than the alternate allele
If preserving allele order is important to you, remember to set --ref_allele when using PLINK

Reading and writing PLINK data programmatically

Many excellent software packages are able to read PLINK 1 data using your favourite programming language, in no particular order:

Partioning data (train / test split)

In many data science tasks it’s common to split data to assess how well models can generalise to new data.

Here’s a simple approach to splitting plink data:

$ wc -l < synthetic_small_v1_chr-1.fam
600

Each fam file in the synthetic data contains the same information: a list of sample IDs. Assuming a standard 80/20 train/test split, grab 480 sample IDs from a fam file:

$ export N_TRAINING_SAMPLES=480
$ shuf --random-source <(yes 42) synthetic_small_v1_chr-1.fam | \
  head -n $N_TRAINING_SAMPLES | \
  cut -f 1,2 -d ' ' > train.txt

Warning

In different datasets (e.g. full HAPNEST) you will need to calculate the number of samples you want in the training set and adjust the above command

Tip

--random-source <(yes 42) is a seed
A seed helps to produce the same random order each time the command is run
Seeds are an import part of reproducible data science

Now split the data into train and test splits:

$ plink2 --bfile synthetic_small_v1_chr-1 \
  --keep train.txt \
  --make-bfile \
  --out train
$ plink2 --bfile synthetic_small_v1_chr-1 \
  --remove train.txt \
  --make-bfile \
  --out test

Six new files should be created:

train.bed, train.bim, and train.fam
test.bed, test.bim, and test.fam

Double check they have the correct number of samples:

$ wc -l < train.fam
480
$ wc -l < test.fam
120

Tip

You could also use data science libraries such as sklearn or caret to do this for you combined with a library that can read PLINK 1 data, like pyplink or genio

Next steps

Section 6 of the HAPNEST paper describes how different polygenic scoring methods were applied to the synthetic data.