Poster
Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling
Ping Li · Xiaoyun Li · Cun-Hui Zhang

Tue Dec 10th 05:30 -- 07:30 PM @ East Exhibition Hall B + C #34

Jaccard similarity is widely used as a distance measure in many machine learning and search applications. Typically, hashing methods are essential for the use of Jaccard similarity to be practical in large-scale settings. For hashing binary (0/1) data, the idea of one permutation hashing (OPH) with densification significantly accelerates traditional minwise hashing algorithms while providing unbiased and accurate estimates. In this paper, we propose a strategy named “re-randomization” in the process of densification that could achieve the smallest variance among all densification schemes. The success of this idea naturally inspires us to generalize one permutation hashing to weighted (non-binary) data, which results in the socalled “bin-wise consistent weighted sampling (BCWS)” algorithm. We analyze the behavior of BCWS and compare it with a recent alternative. Extensive experiments on various datasets illustrates the effectiveness of our proposed methods.

Author Information

Ping Li (Baidu Research USA)
Xiaoyun Li (Rutgers University)
Cun-Hui Zhang (Rutgers)

More from the Same Authors