Near Neighbor Join
An increasing number of Web applications such as friends recommendation depend on the ability to join objects at scale. The traditional approach taken is nearest neighbor join (also called similarity join), whose goal is to find, based on a given join function, the closest set of objects or all the objects within a distance threshold to each object in the input. The scalability of techniques utilizing this approach often depends on the characteristics of the objects and the join function. However, many real-world join functions are intricately engineered and constantly evolving, which makes the design of white-box methods that rely on understanding the join function impractical. Finding a technique that can join extremely large number of objects with complex join functions has always been a tough challenge. In this paper, we propose a practical alternative approach called near neighbor join that, although does not find the closest neighbors, finds close neighbors, and can do so at extremely large scale when the join functions are complex. In particular, we design and implement a super-scalable system we name SAJ that is capable of best-effort joining of billions of objects for complex functions. Extensive experimental analysis over real-world large datasets shows that SAJ is scalable and generates good results.