Challenges for Efficient Query Processing in the Semantic Web
University:Universidad Simón Bolívar, Caracas, Venezuela
Time:4:00pm (coffee: 3:30)
In the context of the Semantic Web, a large number of huge RDF linked datasets have become available, and this number keeps growing. Simultaneously, scalable RDF engines that follow the traditional optimize-then-execute paradigm have been developed to locally access RDF data, and SPARQL endpoints have been implemented for remote query processing. Although queries against locally stored data can be efficiently executed, remote query executions may frequently be unsuccessful. First, the most efficient RDF engines rely their query processing algorithms on physical access and storage structures that are locally stored; however, because of the size of existing linked datasets, loading the data and their links is not always feasible. Second, remote linked data query processing can be extremely costly because of the lack of query planning; also, current techniques are not adaptable to unpredictable data transfers or data availability, thus, executions can be unsuccessful.
In this talk, I will describe both optimize-then-execute techniques and adaptive query processing strategies that have been developed to access RDF data; linked RDF datasets will be used to illustrate the performance of the proposed approaches. In the first part of the talk, query optimization and execution techniques to access locally stored RDF data will be described. These techniques are able to rewrite complex queries into queries comprised of small-sized star-shaped sub-queries; optimized queries not only are able to reduce execution time, but they can benefit from caching data during query execution. These plans can speed up execution time by up to three orders of magnitude, while original queries may exhibit poor performance. In the second part, I will describe ANAPSID, an adaptive query engine for SPARQL endpoints that adapts query execution schedulers to data availability and run-time conditions when data is remotely accessed. ANAPSID provides physical SPARQL operators that detect when a source becomes blocked or data traffic is bursty, and opportunistically, the operators produce results as quickly as data arrives from the endpoints. ANAPSID performance will be compared with respect to RDF stores and endpoints; experimental results will show that ANAPSID can speed up execution time, in some cases, in more than one order of magnitude.