Synthetic data for human genomics
Why are synthetic genomes useful?
Human genetic data is extremely sensitive, and is special category data under the General Data Protection Regulation (GDPR). Human genomes are available from projects like:
However, this data is small (low thousands of individuals) and often lacks linked trait information for privacy and ethical reasons.
When people develop polygenic scores (PGS), they need data from hundreds of thousands of individuals with linked healthcare records. This type of data, available from places like UK Biobank or FinnGen, is not publicly available. It can only be accessed by vetted researchers that have made a research plan and submitted a formal application to the relevant authority.
Realistic synthetic genomes can be useful for people looking to get started developing data science models or methods that use human genetic variation.
What is a genetic variant?
EMBL-EBI host a free online training course describing the basics of human genetic variation.
Synthetic human genotype and phenotype data from HAPNEST
HAPNEST is a program for simulating large-scale, diverse, and realistic datasets for genotypes and phenotypes. A manuscript describing HAPNEST in detail is available at:
Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O’Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna, HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes, Bioinformatics, Volume 39, Issue 9, September 2023, btad535, https://doi.org/10.1093/bioinformatics/btad535
The creators of HAPNEST made a large synthetic dataset freely available for others to reuse and experiment with.
How can I access the synthetic data?
The HAPNEST synthetic data is available from BioStudies (accession: S-BSST936).
Dataset | Genetic variants | Individuals | Genetic ancestry groups | Phenotypes | Genome build |
---|---|---|---|---|---|
Large | 6,800,000 | 1,008,000 | 6 | 9 | GRCh38 |
Small | 6,800,000 | 600 | 6 | 9 | GRCh38 |
The HAPNEST data includes 6.8 million genetic variants and 9 continuous phenotypic traits. Individuals in the dataset cover 6 genetic ancestry groups (AFR, AMR, CSA, EAS, EUR, MID). The creators of the dataset defined genetic ancestry groups using super-population labels from the largest open access globally representative reference panels: 1000 Genomes (1000G) and Human Genome Diversity Project (HGDP). The full dataset including 1,008,000 individuals is several terabytes, so a smaller dataset including 600 individuals is available for testing. Data from BioStudies can be browsed and downloaded using a web browser but better alternatives exist for the largest files.
How can I download very large files from BioStudies?
BioStudies offer multiple ways to download lots of data, including:
Globus
Aspera
FTP
HTTPS using CLI tools like
wget
orcurl
Are there restrictions on data use?
The data are licensed under CC0. There are no restrictions on data use or redistribution. If you use the synthetic data in academic work, it’s polite to cite the original publication.
What are PLINK 1 (bed / bim/ fam) files?
The synthetic dataset are in PLINK 1 binary format. PLINK is a whole genome association analysis toolset - the “swiss army knife of human genomics”. A PLINK bfile is composed of three files:
- A binary biallelic genotype table (
.bed
) - An extended MAP file, which contains variant information (
.bim
) - A sample information file, which contains phenotypes / traits (
.fam
)
The PLINK 1 format can only represent hard-called biallelic variants. Real human genetic variation can be much more complex than this, but it’s a helpful abstraction to start with.
If you want to learn more, the PLINK 2 file format is capable of representing more complex types of genetic variation.
- PLINK1 format doesn’t understand the concept of reference and alternate alleles, only A1/A2 (minor/major) alleles
- This means that alleles will sometimes swap if the reference allele is less common than the alternate allele
- If preserving allele order is important to you, remember to set
--ref_allele
when using PLINK
Reading and writing PLINK data programmatically
Many excellent software packages are able to read PLINK 1 data using your favourite programming language, in no particular order:
Partioning data (train / test split)
In many data science tasks it’s common to split data to assess how well models can generalise to new data.
Here’s a simple approach to splitting plink data:
$ wc -l < synthetic_small_v1_chr-1.fam
600
Each fam
file in the synthetic data contains the same information: a list of sample IDs. Assuming a standard 80/20 train/test split, grab 480 sample IDs from a fam file:
$ export N_TRAINING_SAMPLES=480
$ shuf --random-source <(yes 42) synthetic_small_v1_chr-1.fam | \
head -n $N_TRAINING_SAMPLES | \
cut -f 1,2 -d ' ' > train.txt
In different datasets (e.g. full HAPNEST) you will need to calculate the number of samples you want in the training set and adjust the above command
--random-source <(yes 42)
is a seed- A seed helps to produce the same random order each time the command is run
- Seeds are an import part of reproducible data science
Now split the data into train and test splits:
$ plink2 --bfile synthetic_small_v1_chr-1 \
--keep train.txt \
--make-bfile \
--out train
$ plink2 --bfile synthetic_small_v1_chr-1 \
--remove train.txt \
--make-bfile \
--out test
Six new files should be created:
train.bed
,train.bim
, andtrain.fam
test.bed
,test.bim
, andtest.fam
Double check they have the correct number of samples:
$ wc -l < train.fam
480
$ wc -l < test.fam
120
Next steps
Section 6 of the HAPNEST paper describes how different polygenic scoring methods were applied to the synthetic data.