The New Jersey Data Reduction Report
Authors: 
Daniel Barbara
Authors: 
William DuMouchel
Authors: 
Christos Faloutsos
Authors: 
Peter J. Haas
Authors: 
Joseph M. Hellerstein
Authors: 
Yannis Ioannidis
Authors: 
H.V. Jagadish
Authors: 
Theodore Johnson
Authors: 
Raymond Ng
Authors: 
Viswanath Poosala
Authors: 
Kenneth A. Ross
Authors: 
Kenneth C. Sevcik
Date published: 
1997
Published In: 
IEEE Data Engineering, Vol. 20, No. 4, Dec. 1997, pp. 3-46
Type: 
Journal Article
Abstract: 

There is often a need to get quick approximate answers from large databases. This leads to a need for data reduction. There are many different approaches to this problem, some of them not traditionally posed as solutions to a data reduction problem. In this paper we describe and evaluate several popular techniques for data reduction. Historically, the primary need for data reduction has been internal to a database system, in a cost-based query optimizer. The need is for the query optimizer to estimate the cost of alternative query plans cheaply – clearly the effort required to do so must be much smaller than the effort of actually executing the query, and yet the cost of executing any query plan depends strongly upon the numerosity of specified attribute values and the selectivities of specified predicates. To address these query optimizer needs, many databases keep summary statistics. Sampling techniques have also been proposed. More recently, there has been an explosion of interest in the analysis of data in warehouses. Data warehouses can be extremely large, yet obtaining answers quickly is important. Often, it is quite acceptable to sacrifice the accuracy of the answer for speed. Particularly in the early, more exploratory, stages of data analysis, interactive response times are critical, while tolerance for approximation errors is quite high. Data reduction, thus, becomes a pressing need. The query optimizer need for estimates was completely internal to the database, and the quality of the estimates used was observable by a user only very indirectly, in terms of the performance of the database system. On the other hand, the more recent data analysis needs for approximate answers directly expose the user to the estimates obtained. Therefore the nature and quality of these estimates becomes more salient. Moreover, to the extent that these estimates are being used as part of a data analysis task, there may often be “by-products” such as, say, a hierarchical clustering of data, that are of value to the analyst in and of themselves.

Related files: 

MaDgIK 2009-2018