Imputation: The Secret Sauce of Low-Coverage Whole-Genome Sequencing

The price of affordability

For over a decade after the first human genome was sequenced in 2003, the price of whole-genome sequencing was simply too high for most people to afford. At the turn of the decade, it would cost nearly $100,000 to have a genome sequenced. Today the cost is around $1000, which is still too expensive for many people.

At Nebula Genomics, our mission is to make personal genome sequencing affordable for almost everyone. To this end, we launched low-coverage whole-genome sequencing. It costs less than $100 and sequences more than a billion positions across the human genome. However, while it generates one thousand times more data than other similarly priced products (e.g. the 23andMe genetic test), it still leaves many genomic regions unsequenced (Figure 1). How can these gaps be filled? The answer is imputation, a statistical method for inferring missing data. 

Figure 1. View of a low-coverage (0.4x) BAM file in the Interactive Genome Viewer (IGV). The sequence has significant gaps. However, imputation enables us to take BAM files that have significant gaps and produce VCF files that contain information on most of the common genetic variants.


Think of the sentence “I drove a red ca_ down the road.” You may have noticed that there is a missing letter in the sentence, but your brain likely made a quick judgment to determine that filling in the blank space with an “r” made the most sense! How, though, did your brain settle on that letter when there are 26 letters in the alphabet? By using context clues from the rest of the sentence, such as words like “road” and “drove”, your brain decided that “cap” or “cat” wouldn’t make much sense. On the other hand, inserting “car” created a meaningful sentence.

Imputation of DNA sequences works quite similarly. In this case, the context clues are nearby stretches of DNA that have been sequenced. This works because letters in a genomic sequence are not inherited independently. Instead, they are inherited together as a genomic region that constitutes an (almost) unchanging unit. This means that if one knows some of the letters at a particular stretch of DNA, one can quite reliably predict (impute) all the other letters in the sequence. This enables us to sequence genomes at low coverage and to fill the gaps with high accuracy.

This works quite well for genomic regions with a well-known sequence, but it often fails to identify rare genetic variants and novel mutations that have not been seen before. Going back to the example from earlier, perhaps a tiny portion of readers earn their living as a taxi driver. In this case, maybe “cab” would have made more sense to fill in the missing part of the sentence. Thus if we always guess that the reader chose “car”, our accuracy is going to be less than 100% and we will be wrong for everyone who inserted “cab” instead. This is the reason why imputation works quite well for common genetic variants, but it usually completely misses rare variants. Figure 2 shows a significant drop off in accuracy below a 0.05 variant allele frequency, which means 5% of the population. For this reason, our current reporting only includes common variant alleles that have a population frequency of over 5%.

Figure 2. Imputation of unsequenced genetic variants is very accurate for common alleles but the accuracy decreases significantly for rare variants (< 5% frequency in the population). 

Usecase: Polygenic scores

A particularly good use case for imputed genomic data is the calculation of polygenic scores, a feature that we introduced recently. The reason for that is simple. Because polygenic scores are calculated from many genetic variants, imputing a few variants incorrectly won’t significantly change the calculated score.

The weekly updated Nebula Library today contains over one hundred research studies. For most of these studies, we calculated polygenic scores for all of our users. To learn how the latest advances in human genetics research might relate to you, order our low-coverage whole-genome sequencing and see the power of imputation for yourself! Alternatively, you can also upload your existing 23andMe or AncestryDNA files. We will impute your data and give you FREE 2-weeks access to the whole content of the Nebula Library!

Share this post
Share on facebook
Share on google
Share on twitter
Share on linkedin
Share on print
Share on email

Get your own kit today!

Contribute to medical breakthroughs and get rewarded. Understand your genes. Own your health data.