Home ML in Visual Search: Part I
Author: George Williams
Online retailers piece together different kinds of technology to reduce friction and to keep us coming back as customers. From face recognition for payment to voice recognition for taking orders- AI is powering more ways that help us consumers search, discover, order, and ultimately get what we want.
Visual search is another example. With visual search, you can take a pic of something you like and it’s instantly matched with visually similar products in an online store’s inventory. Visual search is a feature in several mobile shopping apps including ones from Google, Facebook, Microsoft, Wayfair, Pinterest, and many others.
Visual search would not be possible without machine learning. In this blog post, one in a series, we’ll dive into the machine learning that lies at the heart of visual search. We’ll deconstruct eBay’s visual search and showcase how they leverage ML within their massive product inventory.
How can we get computers to compare images for similarity ? It would certainly help if computers could see the world as we see it. Or, at the very least, perceive visual similarity as we humans do. This longstanding challenge in computer vision is as old as computer science itself. Recent advances in machine learning, specifically neural networks, have finally made this possible.
The basic component of a neural network is the neuron. A neuron is just a simple mathematical function that transforms several input signals into one output.
For example, inputs could correspond to pixels of an image. The output could indicate the presence of an object in the image- say 1 if present or 0 otherwise.
Like its biological cousin, the mathematical neuron needs training and practice in order to learn. That training involves presenting it lots of data— labeled examples of images, where in some images the object is present and in others not. This is called supervised learning and the goal is that after several iterations of this, the neuron learns how to generalize and detect the object in any image, beyond the training ones.
It turns out that just one mathematical neuron on its own isn’t powerful enough and can’t solve the binary classification problem just described. No amount of training makes that possible.
The “Not Hot Dog” app uses machine learning.
But, like the biological brain, by connecting many neurons together in just the right way, we can create powerful algorithms that can learn to detect not just one, but thousands of different kinds of objects in images ( that’s called a multinomial classification task. )
Neural networks are now so powerful, they have become the preferred algorithm to tackle difficult problems not just in computer vision, but also in natural language and speech recognition.
Different problems require different kinds of neural networks. For image and video data, convolutional neural networks are used. For natural language and speech, convolutional and recurrent neural networks are used. There are also variational autoencoders, generative adversarial networks and other kinds of architectures.
Neural networks are a very active area of research in machine learning. But that hasn’t always been the case.
The computational power of neural networks is directly related to size. Unfortunately, in the many decades that followed neural network’s conception in the 1950s, researchers had difficulty increasing that size. As a result, research in neural networks languished and researchers flocked to other machine learning techniques.
In 2012, that all changed. A few clever algorithmic tricks enabled the training of much larger networks, into the millions of neurons. The availability of better training data and faster processing on GPUs were also critical factors.
Almost overnight, neural networks became state-of-the-art in the field of machine learning. Since then, neural networks have been successfully rebranded as deep learning. The current day revolution (and hype) in AI has been largely fueled by deep learning. But fundamentally it’s a modern day variant on a very old concept, the neural network.
Deep learning research continues at a brisk pace. That said, neural networks have also stepped out of lab and into the real world. They can be found in mobile devices, laptops, and data center servers-conquering many hard problems that were once thought too challenging for machines to tackle. This includes visual search on massive image datasets.
eBay has a product catalog of nearly a billion items for sale at any one time. Each item belongs to one or more of thousands of product categories. With this data, eBay trains its neural network. The trained network is used to assess visual similarity in images and for automatic product categorization. It enables several important features at eBay:
Query Visual Search In eBay’s ShopBot Facebook messenger app, users can take a snapshot of any object and instantly receive visually similar matches from the eBay store.
Anchor Visual Search eBay automatically matches existing eBay items in the active inventory. As users browse, they will see top visually similar matches for the currently viewed item.
Seller Categorization When users upload a pic of an item to sell, the neural network predicts its product category. Using that information, eBay can place the product into its proper area in the store.
When a shopper uploads a query image for visual search, eBay’s neural network performs two important tasks:
Category Recognition First, the neural network predicts the product category to which the image belongs.
Similarity The neural network also generates a semantic hash for the image. Compared to the image itself, it’s a small number (only 4096 bits vs hundreds of kilobytes ). They use this number to compute a similarity score against all the hashes of products in the predicted category.
eBay returns the products with the highest similarity scores as the result of the query.
eBay also implements “aspect” prediction which attempts to detect brand, color, and other product attributes. Certain detected aspects could re-rank the final matches returned to the user. See the paper for more details on this.
eBay does a few things to ensure low latency in the response to a visual search query:
eBay’s semantic hash method is just one type of distance metric learning. We’ll explore this concept further in the future blog post “A Picture Is Worth a Thousand…Numbers.”
eBay trains its neural network using the existing product images in the active inventory, also using the existing products’ categories as training labels. eBay modified a standard neural network architecture (ResNet) to also learn the semantic hash function (see image above.)
The eBay inventory is highly volatile. Items are bought and sold all the time. This means eBay needs to retrain on a regular basis. That’s a very time-consuming process considering it’s a billion-scale product catalog. For this reason, eBay trains only using a representative sample of each category, not on the entire catalog.
Once the neural network is trained, eBay can then recompute the semantic hash for every item in their catalog. Even though each hash is small (4096 bits), there are billions of them. For this reason, eBay distributes them across several clustered machines.
In a future blog post, “Embedded Embeddings: New Hardware For Visual Search,” we’ll look at how new types of dedicated hardware can simplify both the storage and fast computational search of semantic hashes.
In our next blog post, we’ll continue with the second part of “ML in Visual Search.” We’ll showcase Pinterest’s visual search and recommender engine that powers the Lens Your Look app. Users of this app tend to contribute a lot of data about themselves to Pinterest over the course of time. Thus, Pinterest users expect to see visual search results that align with their sense of personal style. This kind of deep personalization adds a lot of value to visual search, but also some additional complexity.