H2RDF: Adaptive, Join-Scalable Querying of RDF Data in the Cloud
Speaker:

Nikolaos Papailiou

Date: 06/04/2012
University: Computing Systems Laboratory, National Technical University of Athens (cslab.ntua)
Room : A56
Time: 4:00pm (coffee: 3:30)
Slides:
Abstract:

The proliferation of data in RDF format calls for efficient and scalable solutions for their management. Recent approaches have suggested managing RDF data in a decentralized fashion over the cloud, thereby scaling to arbitrarily large numbers of triples. However, such systems fail to genuinely exploit the capabilities offered by cloud computing, as they do not take full advantage of its distributed features in order to execute complex queries efficiently. In effect, despite their capacity to handle large data sets, such systems do not scale well when faced with substantially complex queries.

In this work we present H2RDF, a fully distributed RDF store that combines the MapReduce processing framework with a NoSQL distributed data store. Materializing three index structures over the RDF data in HBase, H2RDF can answer SPARQL queries on virtually unlimited number of RDF triples. Moreover, in contrast to existing approaches, H2RDF can process complex join queries in a highly scalable fashion, while making adaptive decisions over both the join order and execution, assisted by a query planner; thereby, joins are executed using either single-machine or distributed jobs according to the amount of processing needed. Our extensive evaluation demonstrates that H2RDF efficiently answers both simple joins and complex multivariate queries and easily scales to 14 billion triples using a cluster of 15 nodes; it outperforms state-of-the-art distributed solutions in multi-join queries and nonselective queries, has comparable performance to centralized solutions in selective queries, and gains in throughput by concurrent execution.

MaDgIK 2009-2018