Home Optimizing Genomics Research: Reducing the High Computational Cost of DNA Sequence Alignment
Author: Pat Lasserre
Introduction
DNA sequence alignment is an important tool used in many genomic computation applications like understanding genetic diseases and evolutionary biology. The computational cost of aligning DNA sequences to reference genomes, however, is high. With the increasing volume of genetic data, it’s important to find solutions that minimize this computational cost.
This post explains why DNA sequence alignment is computationally expensive and shows how effective filtering of candidates can reduce this computational load. We will explore how accelerating the Myers’ bit-parallel edit-distance algorithm with GSI Technology’s Gemini-I APU, a compute-in-memory chip, can optimize the candidate filtering step.
By the end of this post, you’ll have a better understanding of the challenges in DNA sequence alignment, solutions to address them, and how these advancements will help biological and medical research.
DNA Sequence Alignment
DNA sequence alignment is a key tool used in computational genomics for biological and medical research and discovery.
Sequence alignment helps researchers with tasks such as better understanding diseases and studying evolution. By comparing the DNA sequences of different species, scientists can understand their evolutionary relationships to see how genes have evolved over time. Sequence alignment can also help identify genetic mutations that may cause diseases by comparing the DNA sequences of patients with certain diseases to those without.
Sequence alignment works by aligning DNA sequences of bases adenine (A), cytosine ©, guanine (G) and thymine (T) to reference genomes. This allows scientists to identify regions of similarity and difference. Figure 1 below shows a query sequence being compared to a target sequence.
The Computational Cost of DNA Sequence Alignment
Reference-guided genome assembly requires multiple steps to align a collection of read sequences against a long reference sequence.
The first step is to find candidate locations for each read (query) in the reference genome. Then, each query is aligned to its set of candidates.
Since DNA sequences can be very long, aligning them can take a lot of computation. The alignment step is the most computationally expensive step in the reference-guided genome assembly pipeline.
To help reduce this computation, filtering can be used to reduce the number of candidates for alignment. It also allows the researcher to focus on candidates that are most likely to add meaningful information.
One approach to filtering is to calculate the edit distance between each query and candidate pair to estimate their similarity. The edit distance algorithm calculates the minimum number of edits needed to make a query and candidate align exactly, with an edit being an insertion, deletion, or substitution (Figure 2). Candidates can be filtered out based on a predefined edit distance threshold.
Researchers from Cornell accelerated the filtering step of DNA sequence alignment by using the Myers’ bit-parallel algorithm to measure the edit distance between two DNA sequences.
The Myer’s Bit-Parallel Edit Distance Algorithm
The Myers’ algorithm aligns DNA sequences by calculating edit distances through bitwise operations such as AND, OR, XOR, and NOT. Bitwise operations work on the individual bits of a binary value as seen in Figure 3 below.
Accelerating The Myers’ Algorithm
To accelerate the Myers’ bit-parallel edit distance, Cornell used GSI Technology’s Gemini-I, which is a compute-in-memory (CIM) chip that enables bit processing.
The Gemini-I uses bit-slicing, where each data bit is processed independently. For example, for a 16-bit data item, each of the 16 bits can be processed individually.
The Gemini-I’s bit-slicing architecture stores the corresponding bits of multiple data elements together. For example, the first bits of a set of 16-bit numbers are stored together, all the second bits together, etc. This can be seen in Figure 4, where Bit Slice 0 holds all the 0 bits for each 16-bit data item, Bit Slice 1 holds all the 1 bits, etc.
As mentioned previously, the Myers’ bit-parallel algorithm uses bitwise operations such as AND, OR, XOR, and NOT to calculate edit distances. Bit-slicing is well-suited for these operations because each bit is processed independently, and it can execute them in parallel across all bits of the data elements. This makes Gemini-I’s bit-slicing architecture ideal for the Myers’ algorithm.
The Cornell researchers compared the performance of the APU to an Intel Xeon Gold 6230R CPU @ 2.1GHz with DDR4 when running the Myers’ bit-parallel edit distance algorithm. In their paper, they only used 1 of the 4 cores available on a Gemini-I APU chip for their performance comparison, but they “expect a roughly linear multicore speedup on the APU (similar to what was seen with the CPU), leading to an estimated total speedup of 5–6x for a 4-core APU over a 16-core CPU.”
Conclusion
DNA sequence alignment plays a key role in computational genomics applications such as disease research and understanding evolutionary relationships. It requires a lot of computation, but filtering candidates can greatly reduce that computational load.
Researchers from Cornell used the Myers’ bit-parallel edit-distance algorithm to filter candidates. They found that GSI Technology’s Gemini-I could accelerate the Myers’ algorithm 5–6x when compared to traditional CPUs.
The Gemini-I is a compute-in-memory chip that uses bit-slicing, where each data bit is processed independently — ideal for accelerating the bitwise operations of the Myers’ algorithm.
Click here to access the full Cornell paper, where you will find the details, including microcode, of how the Gemini-I can accelerate the filtering step in DNA sequence alignment.