Category Archives: research

A survey of recent learn-to-hash research

Many ML algorithms, e.g. NLP and information retrieval ML algorithms, output high dimensional feature vectors for their respective objects of interests. Information sources are then transformed into a vector database. When an object is to be searched, the ML algorithm would similarly infer a feature vector, a.k.a a query point for the object. Then nearest neighbors, which are the nearest to the query point in terms of a certain distance measure from the information source database, are returned as the search results. As the popularity and maturity of ML increase in the field of information retrieval, we expect more and more usage of the aforementioned mechanism.

Recommendation algorithms also make heavy use of high dimensional feature vectors to represent users of interest and items to be recommended. Those items, whose feature vectors are nearest to the feature vector of a target user, are often recommended to the target user. Recommendation systems are vital components of today’s commerce and web portals.

However, computing distances between two high dimensional vectors can be computationally expensive. In the era of big data, we often need to deal with databases of billions of records. A linear scan of the database records, followed by expensive distance computations, is simply impractical. This is where hashing algorithms, or more accurately, learn-to-hash algorithms come to rescue. The simple idea behind these classes of hashing algorithms is that, for each high dimensional vector, we can derive a short binary hashing code, with the characteristics of preserving similarities in the original vector space using a simple Hamming distance in the hash code space. This way computation of distances among data points in the code space is trivially fast. Combined with other techniques, we can deliver sub-linear and fast search implementation.

Combining powerful ML modeling algorithms which produce feature vectors of high quality, and powerful learn-to-hash algorithms that produce short and efficient binary codes, we can design highly accurate and yet highly scalable information retrieval systems. And that is why I surveyed the field of learning to hash, and the result is the following PPT:

Continue reading

A Fast Numerical Method for the Optimal Data Fusion in the Presence of Unknown Correlations

I recently published a conference paper accessible at IEEEXplore.

Published in: 2018 21st International Conference on Information Fusion (FUSION)

Abstracts:

In the presence of unknown correlations, the optimal data fusion, in the sense of Minimum Mean Square Error, can be formulated as a problem of minimizing a nondifferentiable but convex function. The popular projected subgradient methods are known to converge slowly. The single-projection optimal subgradient method, OSGA-V, is known to be numerically more efficient. This paper presents necessary formulations and methods for the application of the OSGA-V algorithm in the minimization of the optimal data fusion problem, achieving much faster convergence rate than the projected subgradient method. We expect this method to significantly reduce the computational cost and time to achieve optimal data fusion in the presence of unknown correlations.

Conference PPT:

XiaohaiFusion2018