Large-scale graph partitioning with Apache Giraph
Facebook’s architecture relies on various services that answer queries about people and their friends. Because of the size of the dataset, number of queries per second, and latency requirements, many of these systems cannot run on a single machine. Instead, people and their metadata are sharded across several machines. In such a distributed environment, answering queries might require communication among all these servers. Imagine that one of our services requires fetching information about all the friends of a specific person. Such a query would be first directed to the shard that contains the person's data (including the list of friends). Then, for each friend, a query would be issued to the respective shard, asking for the required information. All the responses would then be aggregated to form the final answer. If people are distributed randomly across shards (for example, by hashing their user IDs), such a query may hit almost all the machines in the system and require a lot of network traffic, which could result in high latency.
We have solved this problem by using graph partitioning, which can be formalized as follows: Given an undirected graph G = (V, E) and a natural number k, we want to partition the vertex set V into k equal-sized subsets, so as to maximize the number of edges that have both endpoints in the same partition (we call those "local edges"). The balance constraint here is important: If too many people are allocated to the same machine, it will become a bottleneck, trumping savings in network I/O. Further, the machine may not be able to store all the data necessary (this is true for in-memory systems like Giraph). This requirement is what makes it prohibitive to compute an optimal solution for large graphs – even a simplified version with two partitions (the "minimum bisection problem") can be proven to be NP-hard. Approximate algorithms exist, but they haven’t been proven on massive, real-world graphs yet. Hence we were looking for a heuristic that worked at our scale while still providing high edge locality.