Decoding Diversity with gnomAD

The interpretation of rare genetic variations remains one of the central challenges in human genomics. Any two individuals differ by roughly three and five million positions across their 3 billion–base pair genomes. Pinpointing which of these differences in a disease-bearing individual are functionally or clinically relevant remains a formidable challenge and requires nuanced interpretation.
Thanks to latest advances in technology to crunch through data troves, population-scale genome sequencing offers a path forward, providing the necessary context critical for rare disease gene discovery and the broader effort, e.g. variant classification to understand genetic contributors of human disease.
Genetic Variation – The Clues Within
“There’s just no capacity for any geneticist to interpret all the variants one by one,” said Professor Heidi Rehm, Chief Genomics Officer at Massachusetts General Hospital and Chief Medical Officer of Broad Clinical Labs.
To narrow down the search for causative variants, the first step is to eliminate common genetic variations. If a variant is prevalent in the general populations at a rate higher than the disease prevalence, it is unlikely to be the cause.
“The more populations we study, the more variants that are unlikely disease-causing can be ruled out, thus allowing for a clinical lab or the research team to focus on a much smaller set of variants,” said Prof Rehm, who is also a principal investigator on major initiatives such as the Clinical Genome Resource and the All of Us research programme.
Conversely, if a variant is found in a DNA (genomic) region that generally exhibits little variation across diverse populations, it is more likely to be associated with disease. This principle is the driving force behind the Genome Aggregation Database (gnomAD), a resource that collates genetic data from global populations. By providing large-scale allele frequency estimates across a large spectrum of diverse populations and ancestries, gnomAD is a resource for researchers and clinicians to assess variant rarity, refine disease gene discovery and improve the clinical interpretation of genetic data across a wide range of applications.
While the majority of gnomAD data has been aggregated into a centralised repository from a handful of large-scale genetic diversity initiatives and publicly available legacy datasets , emerging national genomic initiatives offer a unique opportunity to broaden population representation, especially for underrepresented groups.
However, this data centralisation approach is not feasible for all datasets, as certain datasets must remain within their local environments due to security, policy or legal constraints. To address this, the federated gnomAD project was set up to incorporate these datasets, after individual-level data is aggregated. Participating sites are guided to process their data in a decentralised way where individual-level data remains at the source, yet the aggregate results are combined into a single user interface and downloadable for use by the community. Hence, a federated approach enables secure, decentralised analysis where data remains at the source, ensuring greater population diversity representation, data security, privacy and scalability.
Singapore’s Unique Contribution—and Opportunity
As a leading scientific and innovation hub in Asia, Singapore’s participation in the federated gnomAD project provides added boost to enrich the diversity of genomic data on a global scale and promotes the nation’s visibility on the international stage among the genomics and precision medicine community.
Singapore’s multi-ethnic population —comprising Chinese, Indian, and Malay communities— positions it a uniquely valuable contributor to federated gnomAD. These groups together represent at least 80% of the genetic variation found across Asia, a region home to more than 4.8 billion people, making Singapore a critical node in efforts to build more inclusive and representative genomic reference resources.
The rich data resource from Singapore enhances the breadth of Asian genomic data but also ensure that Southeast Asian populations are represented in global precision medicine initiatives, thus complementing existing global datasets, which are predominantly of European ancestry.
Meet two local researchers in Singapore, Dr Nicolas Bertin and Dr Maxime Hebrard who are part of the federated gnomAD network.
“What is unique is that we have a sizeable Malay population, which is not well-studied, and markedly underrepresented in global datasets,” said Dr Nicolas Bertin, Group Leader of the Genome Research Informatics and Data Science Platform (GRIDS) at the Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR). “Including these rare populations adds tremendous value to the database itself but also benefits the global research and clinical communities.” Greater ancestral diversity enhances the resolution of variant interpretation worldwide, enabling more accurate clinical diagnosis and equitable genomic research across populations.
At the same time, Singaporeans also gain from being able to compare their genetic variations with those from individuals of similar ethnic backgrounds in the global gnomAD dataset. “There are about five million people in Singapore, and we have the DNA (whole genome) sequences of 100,000 individuals through the PRECISE-SG100K study,” Dr Bertin explained. “However, globally, there are far more individuals of Chinese or Indian descent, which will provide much greater depth to the data than would be possible with Singapore’s data alone. This translates into more precise allele frequency estimates and deeper insights into rare variants in our populations.”
The allele frequency estimates enable researchers and clinicians to utilise these insights to improve genetic analysis of the Asian population and improve the management of patients with rare genetic conditions. For example, in the SG10K_Health study, Singaporean Chinese, Malay and Indian individuals were found to have a high prevalence of familial hypercholesterolemia and hereditary breast and ovarian cancer-associated variants at 1 in 140 and 1 in 150 respectively.
Data Federation: Maintaining the Highest Standards of Data Integrity and Privacy
Traditionally, researchers had to centralise sensitive data in a single repository, often facing logistical, ethical and cross-jurisdiction legal hurdles. Federation sidesteps these barriers by allowing data to remain securely within national boundaries while still enabling collaborative analyses, paving the way for broader global participation in genomic research.
While sharing genetic data provides substantial scientific and medical benefits, it needs to be conducted in a way that safeguards privacy. “There is a lot of sensitivity about sharing genetic data or health information, which is necessary,” said Dr Bertin. “Consolidating genetic information by shared ancestry and geography provides extremely useful information to the scientific community all the while providing privacy-preserving data anonymisation guardrails. Each person’s data is anonymised as one of hundreds or thousands of datapoints and there is no way to re-identify an individual,” Dr Bertin said. Establishing federated access to such datasets lays the foundation for future integrations of more granular health information, where robust privacy-preserving safeguards are under development.”
Harmonising Data Standards and Expanding to Reach More Groups
“The idea of establishing the federated gnomAD project is that the gnomAD team shares the method of analysing the data so that each group can analyse its dataset in its own country, under its own legislation,” explained Dr Maxime Hebrard, Associate Researcher at GIS GRIDS, A*STAR. “Hence, each group can benefit from the gnomAD team experience and ensure uniformity in processing of the data that is aggregated and shared.”
All participating sites, including Singapore, adhere to federated gnomAD’s best practices for data processing and quality control. Maintaining a standardised approach to data analysis is crucial for ensuring both the accuracy of the results and the compatibility of information shared across gnomAD nodes’ federation. The joint-call method used across the federated gnomAD project participating research groups ensures that each dataset is processed using the same variant-calling algorithm, resulting in consistent and high-quality data.
However, as the gnomAD initiative expands to include more research groups worldwide, and as variant-calling algorithms continue to evolve, maintaining standardisation will become more challenging. “Furthermore, as datasets grow larger, the cost of reprocessing all the data becomes prohibitive,” noted Prof Rehm, who also serves as the Co-Director of gnomAD.
“About 85% of the genome can be called relatively accurately,” noted Prof Rehm. “If we separately focus on the remaining 15% and potentially reprocess just those regions, this may allow us to improve the quality of these more challenging regions without incurring the cost of reprocessing the entire genome.” Following this line of thinking, researchers can explore artificial intelligence-based algorithms that identify genomic regions which can be consistently analysed, regardless of the specific variant-calling algorithm used.
A New Value Equation and its Impact on Population Health
As the cost of genome sequencing becomes more accessible, the cost-benefit equation is shifting. “There are now circumstances where the value of that data is high enough that sequencing could perhaps one day be free to the individual or even free to the healthcare system,” Prof Rehm said. “Other places are also generating large datasets because the value of the genomic and associated health data can often offset the cost of DNA sequencing.”
Beyond research, this shift has profound implications for healthcare. “This is more than just a scientific exercise; this enables preventive medicine —identifying genetic risk factors early to guide timely interventions and reduce disease burden,” said Dr Bertin. “This does not only benefits patients but also the healthcare system —whether public, private, or a mix of both— by providing more cost-effectiveness and improving long-term patient outcomes overall.”
Ultimately, gnomAD demonstrates that while each person’s genetic data may seem like a small piece of the puzzle, when combined with millions of others, it becomes a powerful tool that benefits individuals and populations worldwide. “With the development of the federated gnomAD project, we can take advantage of the population data from every corner of the earth. Not only does this enhance the process of interpreting variants for us as clinical geneticists, but more importantly, it contributes to a global scientific effort,” Prof Rehm concluded.