How to Start Exploring your Raw Genomic Data

Note: Nebula Genomics no longer supports downloading BAM files. Instead, users can download CRAM files which are compressed BAMs.

Next-Generation Sequencing

DNA is a molecule that encodes the blueprint of every living organism. DNA is a chain-like molecule of variable length made of four building blocks, commonly called letters. The four letters of DNA are adenine (A), thymine (T), cytosine (C), and guanine (G). Methods that determine the letter sequence of DNA molecules are called sequencing. Next-generation sequencing (NGS) is a high-throughput DNA sequencing technology that enables the reading of billions of DNA molecules in parallel. This generates billions of short sequencing reads (~ 150 letters) that are stored in text files in the FASTQ format.

We launched Basic WGS to create an affordable entry to personal whole genome sequencing. Basic WGS is a shallow whole-genome sequencing at an average coverage of 0.4x per base that results in ~ 1.3 billion sequenced bases out of ~ 6.4 billion bases in the human genome. In comparison, most other personal genomics companies, including 23andMe and AncestryDNA, use microarray-based genotyping that reads the human genome at only ~ 500,000 positions.

Sequencing Data Processing

The continuous DNA sequence of a human genome can be computationally reconstructed by using overlaps between short sequencing reads. The reconstruction of a genome can be facilitated if a reference genome is available to which the sequencing reads can be aligned. Utilization of reference genomes is possible because representatives of a species are genetically highly similar — for instance, any two human genome sequences are almost identical.

For example, in basic WGS we use the human reference genome GRCh37 (hg19). For Deep and Ultra Deep WGS we use GRCh38 (hg38).

Hereby a sequence alignment tool is used to map short reads stored in a FASTQ file to the GRCh37 (Figure 1) or GRCh38 reference genome. This generates a Binary Alignment Map (BAM) file and an associated BAI (Binary Alignment Index) file. FASTQ files are typically discarded after generating BAM files since no information is lost during the alignment process. BAM files can be easily transformed back into FASTQ files, for example using samtools:

samtools fastq input.bam > output.fastq

DNA Variant Calling
Figure 1. Reconstructions of a genome by aligning short reads to a reference genome.

After sequencing reads are aligned to a reference genome, the differences between the sequenced genome and the reference genome can be identified. This process is called “variant calling” and produces files in the Variant Call Format (VCF). Hereby we impute the unsequenced portion of the genome using a set of reference genomes that was generated by the 1000 Genomes Project. This yields an average accuracy of ~ 99% per base across the whole genome, which is sufficiently high for predicting ancestry and traits. For users who want to gain insight into disease risks, carrier status and pharmacogenomics we will soon launch our clinical-grade whole genome sequencing that achieves higher accuracy by sequencing each position in the genome on average 30 times.

Exploring Genomic Data

The first iteration of Basic WGS reporting includes prediction of ancestry and 27 different traits. However, it is important to understand that personal genome sequencing is the beginning of a journey that will continuously yield more insight, especially as science advances and new discoveries are made. Thus we will be regularly adding new traits to our reports as well as continuously increasing the granularity of our ancestry predictions.

We also give our users access to their genomic data (BAM, BAI and VCF files) and invite them to explore their data themselves. Because uploading personal genomic data to third-party websites poses privacy risks, we want to introduce a few tools that can be used locally on personal computers.

Viewing BAM files with a genome browser

Genome browsers are used for browsing through reads that are aligned to a reference genome sequence and stored in BAM file. You can try out the Interactive Genome Viewer (IGV).

  1. Download IGV for your operating system and install it.
  2. Download your BAM and BAI files through your Nebula Genomics account.
  3. Open IGV and set the reference genome to hg19 (dropdown in the top left) and download it for better performance (Figure 2). To do this go to the menu bar and select “Genomes” → “Load Genome for Server …” → “Human hg19” and check the box for “Download Sequence”.
  4. Drag and drop your BAM file into IGV. Your BAI file must be in the same folder as your BAM file.
  5. View your sequencing reads aligned to the reference genome by selecting chromosomes (1) or search by gene names (2) and then zooming into the sequence (3).
Interactive Genome Viewer
Figure. 2 Interactive Genome Viewer

Determining mtDNA haplogroup

Mitochondria are cell organelles that generate most of the cell’s supply of chemical energy. Mitochondria also have their own genome that is passed on by mothers to their children. Human mitochondrial DNA (mtDNA) haplogroups represent the major branch points in the evolutionary path of the female lineage. It enables the tracing of modern humans back to their origins in Africa and the subsequent spread around the globe (Figure 3).

mtDNA haplogroups
Figure 3. mtDNA haplogroups around the globe. Adapted from FamilyTreeDNA.

You can determine your haplogroup by analyzing mtDNA reads in your BAM file. For this, you can use the BAM Analysis Kit.

  1. Download and launch the BAM Analysis Kit. This tool is available for Windows PCs only. (Windows troubleshoot)
  2. Choose “M” for mtDNA (1) as shown in Figure 4. Uncheck all other boxes.
  3. Click “Browse” (2) and select your BAM file.
  4. Click Start Analysis. The processing can take up to an hour.
  5. Open the MtDNA_Haplogroup.txt file to find your mtDNA haplogroup.
BAM analysis kit
Figure 4. Determining mtDNA haplogroup with BAM Analysis Kit.

Converting WGS Files to 23andMe Files

The 23andMe file format is currently the most popular format for personal genomic data. Thus most consumer-focused tools take files in the 23andMe format as input. To use these tools you can convert your file into a file in the 23andMe format. Note that WGS files contain much more information than 23andMe files. By converting into the 23andMe format we are discarding a lot of information for the sake of compatibility with commonly used tools.

Deep and Ultra Deep WGS VCF Files

Note: This method is for high-pass

Download WGSExtract. A 23andMe file can be generated from our Deep/Ultra Deep WGS CRAM file following these instructions.

Basic WGS VCF files

Note: The below python script is for low-pass only

1. Download VCF-to-23andMe. The two scripts in this directory require Python 3.

2. First, run the  data_to_db.py script using your VCF file as input. This generates the genome.db file:

> python3 data_to_db.py input.vcf.gz vcf genome.db

3. Then run db_to_23.py script using the genome.db file as input. This produces a file in the 23andMe format:

> python3 db_to_23.py genome.db blank_v3.txt 23andMe.txt

Calculating Neanderthal DNA Percentage

Neanderthals are an extinct species of humans, who lived within Eurasia until 40,000 years ago. Because Neanderthals have interbred with modern humans, most people have some Neanderthal DNA in their genome. You can use the Ancient Calculator to find out how much of your genome is shared with Neanderthals and other ancient human relatives.

  1. Download and launch Ancient Calculator (Figure 5). This tool is available for Windows PCs only.
  2. Select an ancient DNA sample that you want to match your genetic data against (1). For example, select “Altai Neanderthal”.
  3. Click “BROWSE” and select your genomic data in the 23andMe format that you have generated from your VCF file. The calculation takes just a few seconds.
Ancient Calculator to find Neanderthal DNA in a human genome
Figure 5. Ancient Calculator.

More resources for data exploration

About The Author