A survey of recent learn-to-hash research

Spread the love

Many ML algorithms, e.g. NLP and information retrieval ML algorithms, output high dimensional feature vectors for their respective objects of interests. Information sources are then transformed into a vector database. When an object is to be searched, the ML algorithm would similarly infer a feature vector, a.k.a a query point for the object. Then nearest neighbors, which are the nearest to the query point in terms of a certain distance measure from the information source database, are returned as the search results. As the popularity and maturity of ML increase in the field of information retrieval, we expect more and more usage of the aforementioned mechanism.

Recommendation algorithms also make heavy use of high dimensional feature vectors to represent users of interest and items to be recommended. Those items, whose feature vectors are nearest to the feature vector of a target user, are often recommended to the target user. Recommendation systems are vital components of today’s commerce and web portals.

However, computing distances between two high dimensional vectors can be computationally expensive. In the era of big data, we often need to deal with databases of billions of records. A linear scan of the database records, followed by expensive distance computations, is simply impractical. This is where hashing algorithms, or more accurately, learn-to-hash algorithms come to rescue. The simple idea behind these classes of hashing algorithms is that, for each high dimensional vector, we can derive a short binary hashing code, with the characteristics of preserving similarities in the original vector space using a simple Hamming distance in the hash code space. This way computation of distances among data points in the code space is trivially fast. Combined with other techniques, we can deliver sub-linear and fast search implementation.

Combining powerful ML modeling algorithms which produce feature vectors of high quality, and powerful learn-to-hash algorithms that produce short and efficient binary codes, we can design highly accurate and yet highly scalable information retrieval systems. And that is why I surveyed the field of learning to hash, and the result is the following PPT:


Leave a Reply