High-Performance, Billion-Scale Similarity Search

Author: Pat Lasserre

In Part 1 of this series, we introduced the concept of embedding vectors. In Part 2, we discussed how embedding vectors can be used in similarity search applications to provide personalized customer recommendations, which in turn create a better customer experience.

In this third and final part of this blog series, we will discuss the approximate nearest neighbor (ANN) algorithm. We will present how ANN can be used to address the billion-scale similarity search problem that many large e-commerce and social media companies are faced with. Those companies have massive databases that are growing by the day.

We will also introduce a new technology that efficiently handles the billion-scale similarity search problem and does so with low latency, which is critical for many online applications

Scaling k-NN

As mentioned in Part 2, the conventional way of finding nearest neighbors is to do an exhaustive search of the database, where every point in the dataset is compared to every other point. Exhaustive search does not scale well, however, and it becomes impractical with billion-scale datasets because it would be too slow.

A better approach would be to first do a quick filter to find a subset of the candidates that have a high probability of being the nearest neighbors. That is what the approximate nearest neighbor (ANN) algorithm does, and in doing so it trades off a slight amount of quality of results for a large performance gain and also provides a way to scale nearest neighbor search.

Narrowing the Search Problem with Clusters

One important tool that the ANN algorithm uses is clustering, where similar items are grouped into clusters. Figure 1 below shows an example of how a dataset could be grouped into different clusters of similar items.

Figure 1 Example of dataset grouped into clusters of similar items.

Once a dataset has been grouped into clusters of similar items, a quick filter can be performed by comparing the query vector to the centroids of each cluster.

The centroid of a cluster is the mean of the values of the data in that particular cluster. Only the data items from the closet clusters are then sent on to the next stage of the similarity search pipeline. This significantly reduces the number of items that need to be compared against.

This technique is called an approximate nearest neighbor search since only a subset of the original dataset is being sent to the next stage of the search, and thus it is not 100 percent accurate. The results are still very accurate, however, because there is a high probability that the nearest neighbors will be among the filtered set since they came from the clusters closet to the query.


One way to gauge the accuracy of an ANN search is to measure recall. There are a few different ways to measure recall, one of which is recall@R, which is defined as the fraction of queries where the actual nearest neighbor is ranked in the first R results. For example, recall@100 checks to see if the actual nearest neighbor is within the first 100 results returned.


Latency is one of the key performance metrics for ANN solutions. Some ANN algorithms present latency numbers based off of batched requests. While batching is a good way to queue up requests if the overall system latency requirement is less critical, for many real-world online applications, having the lowest latency possible is critical. In those cases, requests are processed one-by-one and the latency numbers are presented on a query-by-query basis.

Gemini APU — Millisecond Query-by-Query Latency

Providing low latency with high recall for query-by-query searches on billion-scale datasets is a difficult task. That said, GSI Technology recently published a paper that presented strong results for the DEEP1B dataset.

In that paper, we introduce the Gemini® Associative Processing Unit (APU) and present the Gemini APU’s role in an Approximate Nearest Neighbor (ANN) similarity search pipeline. We also present latency and recall numbers for query-by-query ANN searches on the DEEP1B dataset. DEEP1B is a public dataset consisting of 1 billion, 96-dimensional vectors with each dimension being a 32-bit floating point (FP32) number.

The DEEP 1B dataset test with 4 Gemini APU boards running at 400 MHz delivered query-by-query latency performance results of 1.25 milliseconds with recall@160 of 92.5% in a 1U form factor. To our knowledge, this is the first published record of near 1 millisecond latency in such a low form factor with this level of accuracy on such a large dataset for query-by-query search.


In this three part series, we showed examples of companies leveraging word2vec-like models to learn embeddings. They use those embeddings as part of a similarity search pipeline in order to provide personalized recommendations for their customers, ultimately gaining customer loyalty.

We discussed the approximate nearest neighbor (ANN) algorithm and how it can be used to address the billion-scale similarity search problem that many large e-commerce and social media companies are faced with.

Finally, we presented the Gemini APU which serves as the cornerstone of an ANN similarity search solution that addresses the billion-scale similarity search problem and does so with query-by-query latency of around one millisecond. This means that the Gemini APU is well-equipped for the many online applications where low latency is critical.

©2024 GSI Technology, Inc. All Rights Reserved